1 Introduction

Following surgery, breast cancer patients often receive adjuvant therapy for at least 5 years, as a precaution against relapse. However, many patients experience therapeutic failure marked by disease recurrence, in the form of local and regional relapses or distant metastasis (Carlson 2010). These recurrent disease lesions often occur within the period of adjuvant therapy, and indicate that residual tumor cells do not respond to adjuvant therapy or have weak responses (Carlson 2010). If recurrence does not happen during the administration of adjuvant therapy, the incidences of recurrence afterwards tend to be sporadic (Carlson 2010; Brewster et al. 2008). These sporadic incidences are believed to be due to tumor cells exiting dormancy over time. While late disease recurrence is indicative of some levels of responses to the adjuvant therapy, early breast cancer recurrence poses serious threat to patients’ lives. As such, methods that predict whether or not breast cancer patients will develop early recurrence, using disease attributes collected at the time of initial diagnosis, could prove very useful to help determine disease prognosis and the making of clinical decisions.

A widely-explored approach to develop prediction models is to calculate an arbitrary prognostic score established by multivariate regression models, using disease characteristics, immunohistochemistry, gene expression profiles, alone or in combination (Campbell et al. 2010; Galea et al. 1992; Zhang et al. 2013; Barton et al. 2012; Parisi et al. 2010). While these regression models are well curated, they also have a few limitations. They take account of every case in the same manner, even when dealing with highly heterogeneous populations. Moreover, the good performance of a regression model depends on carefully selecting relevant disease characteristics, thus requiring extensive prior knowledge. In addition, it is not always possible to interpret the contribution of individual characteristics from the mathematical formula describing the model. Finally, regression models yield either a score related to an outcome or a probability of an outcome, rather than the outcome itself. To overcome these limitations, we use an alternative approach to build a Decision Tree classifier. The classifier groups patients based on similar disease attributes and outcomes, list the disease attributes in a hierarchical order based on their relevance to the outcomes, and predict for the status of whether or not a patient will develop early breast cancer relapse.

The principle of a Decision Tree algorithm is to continuously partition a group of heterogeneous examples, using the values of several descriptors (feature attributes), to obtain subgroups that are homogeneous of pre-defined classes (class attribute) (Lee and Hsu 1990; Quinlan 1993). As dividing a group based on a feature attribute results in at least two subgroups, which are relatively more homogeneous than the parental group, a decrease in system disorder (or entropy) can be calculated using a probability-based formula and denoted as Information Gain. Partitioning the examples using a feature attribute with higher Information Gain results in a better-organized system with respect to the class attribute. As such, Information Gain serves as a criteria to evaluate the relevance between individual feature attributes and the class attribute (Quinlan 1993; Mitchell 1997). The algorism iteration starts by partitioning examples using the feature attributes that yield the biggest Information Gain; and stops when a subgroup is homogeneous or when the Information Gain of remaining attributes falls below a certain threshold (Lee and Hsu 1990; Quinlan 1993). This results in a tree-like structure with the feature attributes showing as the “branches” and the subgroups showing as the “leaves”. By tracing the feature attributes of an incoming example, one can make a prediction for the status of the target attribute of that example.

We were particularly interested in using a Decision Tree classifier to study whether stroma percentage and TGFβ signaling biomarkers are relevant to early breast cancer recurrence. Piling studies using regression models show that these factors have different or even contrasting associations with breast cancer recurrence in subgroups of patients. As such, their implications in breast cancer pathology are context-dependent. However, these contexts remain to be defined in a systematic manner. By grouping patients based on their similar outcomes and disease characteristics, a Decision Tree classifier is capable to achieve this goal.

Stroma percentage in tumor core is an emerging prognostic indicator for several types of cancer (Gujam et al. 2014; Downey et al. 2014; Huijbers et al. 2013; Moorman et al. 2012; de Kruijf et al. 2011). In breast cancer, the prognostic value of stroma percentage is context dependent. While high stroma percentage is associated with shorter times of relapse-free survival and overall survival in triple negative breast cancer patients (Moorman et al. 2012), it is associated with longer times of relapse-free survival and overall survival among patients with ER+ breast tumors (Downey et al. 2014). In a mixed population of various subtypes, intra-tumor stroma loses its prognostic value, as determined by a multivariate analysis (Ahn et al. 2012), likely because this method fails to highlight differences within a highly heterogeneous population.

The canonical TGFβ/Smad pathway is also implicated in breast cancer pathology in a context-dependent manner (Massague 2008; Lebrun 2012). In normal mammary gland and early stage, low-grade breast carcinomas, TGFβ functions to maintain homeostasis and this effect is largely due to its growth-inhibitory and pro-apoptotic functions (Mazars et al. 1995). However, in advanced-stage breast tumors, TGFβ promotes aggressive behaviors such as cell migration, cell invasion and homing at distant metastatic sites (Muraoka et al. 2002; Padua et al. 2008). Binding of the TGFβ ligand to its two serine/threonine kinase receptors, results in the recruitment and subsequent activation of specific downstream signaling molecules, called Smads (Smad2, 3 and 4), which then translocate to the nucleus to regulate gene transcription (Shi and Massague 2003).

The Decision Tree classifier that we generated identifies the status of lymph node involvement, intra-tumor stroma percentage, and percentages of tumor cells expressing components of TGFβ-Smad signalling to be highly relevant to the status of early breast cancer relapse. It is estimated to be about 70% accurate, and correctly predicted for 55 out of 65 patients in an independent validation dataset.

2 Materials and methods

2.1 Dataset

The dataset contained the following types of information of 574 patients of non-metastatic invasive breast cancer who received surgeries in the Leiden University Medical Center: age, pathological grade, TNM (tumor, node, metastasis) stage, local and systemic therapy, recurrence status (local, regional and distant), time of recurrence following initial treatment and overall survival. Tumor cores were subjected to Haematoxylin and Eosin (H&E) staining for scoring percentages of intra-tumor stroma by two investigators. In addition, percentages of cells expressing the following factors were determined by standard immunohistochemistry procedure: estrogen receptor (ER), progesterone receptor (PgR), epidermal growth factor receptor 2 (HER2) and Ki-67. A tissue microarray (TMA) was constructed from tumor cores of these patients, subjected to immunohistochemistry (IHC) and scored for percentages of tumor cells expressing the following factors: TGFβ type I and type II receptors (TGFΒR1 and TGFΒRII, respectively), nuclear Smad4 and nuclear phospho-Smad2.

Details on the patient cohort, methods of stroma percentage scoring and materials and methods of IHC are reported in previous studies (de Kruijf et al. 2011, 2013; Dekker et al. 2013). These procedures are in accordance with those listed in REporting recommendations for tumour MARKer prognostic studies (REMARK) (LM et al. 2005). Names and brief descriptions of the attributes used are included in Table 1.

Table 1 A list of 55 attributes used as inputs of the Decision Tree

2.2 Decision tree

We defined the class attribute as the status of breast cancer recurrence in the first 3 years after diagnosis (disease free or tumor recurred). We arbitrarily chose this endpoint as these patients had minimal benefit from adjuvant therapy. Therefore, their disease outcomes help to predict for patients who likely do not respond to adjuvant therapy.

We used 55 breast cancer disease characteristics as feature attributes (Table 1). Most of them are well-established disease characteristics, used by physicians worldwide to describe breast tumors and form treatment plans such as pathological grades, clinical stages and expression of molecular markers. In addition, we also included several characteristics whose roles in breast cancer recurrence are controversial, as determined by linear regression methods. These characteristics include stroma percentage in tumor core and percentage of cells expressing TGFβ signaling components.

We used Rapidminer 6.0 to implement the Decision Tree. Rapidminer’s Decision Tree operator is derived from Quinlan’s C4.5 Decision Tree (Quinlan 1993). We chose to rank attributes based on Information Gain-Ratio. This is a modified Information Gain method, which normalizes Information Gains of all attributes to minimize bias towards attributes that contain large numbers of unique values (distinctive yet non-relevant information) (Mitchell 1997). We set the minimum size of split as 4, minimum leaf size as 2, the minimum gain ratio to split with an attribute as 0.1. We grew the tree for up to 10 steps and do post-pruning.

2.3 Estimation of accuracy

To estimate the accuracy of the Decision Tree classifier during the building step, we coupled the model building process with 2 different resampling validation methods: 10-fold bootstrapping validation with a 0.9 sampling ratio and 10-fold cross validation with stratified sampling. As such, these two methods are comparable, that each round of repeating validation uses 90% of the available data to build a model and then uses the remaining to test the accuracy of the model. Results show an estimated accuracy with standard deviation obtained from the 10 slightly different models.

2.4 Validation after model building

In the model building process, we excluded a dataset of 69 patients with missing Smad4 values (Smad4 null). This dataset served as an independent validation dataset, because it was excluded from model building and estimation processes. We eliminated 4 patients in this cohort, as they died within 3 years of diagnosis but did not develop disease recurrence. The prevalence rates of early breast cancer relapse in the original cohort, in the cohort that we used to build the classifier and in the Smad4 null cohort were comparable as 22.47%, 23.4% and 23.08%, respectively. Using the standard truth table, we calculated the sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) with predictions for patients of this dataset.

3 Results

3.1 Data pre-cleaning

The original dataset contained missing values for every TGFβ signaling components, due to tissue falling off from slides during the immunohistochemistry process. To maximize the utilization of real data in the algorism training process, we did not fill missing values. Instead, we excluded 69 cases that had missing values for nuclear Smad4 (Smad4 null), since this attribute contained the most missing values. We then combined local recurrence, regional recurrence and distant relapse into one status, and defined the time of recurrence as the earliest time when any of the recurrence event occurred.

We further eliminated 22 patients who did not develop disease recurrence but died of non-breast cancer causes within 3 years. Data pre-cleaning resulted in a dataset of 487 examples with less than 10% missing values for nuclear Smad2, TGFβ type I and type II receptors. The missing values of each attribute were then filled with the average of known values of that attribute.

We assigned one of the following attribute types to each of the 55 feature attributes. Numerical attributes contain values of real numbers. Nominal attributes contain values of a category. Integer attributes, such as clinical and pathological stages, are orderly nominal attributes and therefore also have a numerical nature. Of the target attribute (3-year relapse), we assigned a binominal value for each patient. Patients who were disease-free received 0, and patients who developed relapse received 1.

3.2 Performance of the decision tree classifier

We generated a Decision Tree classifier to predict for breast cancer recurrence within 3 years of the initial diagnosis, using a patient dataset containing information on clinical diagnosis, pathological diagnosis, stroma percentage and expression of TGFβ signaling components (Table 1). The Decision Tree operator nested with bootstrapping validation or cross validation generated similar tree structures and similar estimated accuracy, even if the sampling methods differed. Table 2 shows the estimated accuracy, estimated sensitivity (class recall) and estimated specificity (class precision) of the classifiers. Furthermore, Bayesian Boosting, which generated 9 additional tree structures every round during model building to vote for consensus, did not remarkably improve model accuracy (data not shown). We also found that growing the tree to the depth of 10 was ideal for this dataset. Neither growing the tree deeper nor not pruning the tree changed the major structure of the tree (data not shown). Altogether, these results suggest that the classifiers that we obtained captured major properties of the dataset.

Table 2 Estimated accuracy of the decision tree classifier

Figure 1 shows the decision tree validated by cross validation. The classifier presents patients in 66 leaves. Each leaf represents a subset of breast cancer patients with similar disease characteristics. Even though 2 different leaves could have the same patient outcome, each leaf is independent and can be summarized with a distinct subset of attributes. As such, the classifier grouped breast cancer patients into different subsets based on their intrinsic properties.

Fig. 1
figure 1

Structure of the Decision Tree classifier. A cohort of 483 patients was continuously divided into 67 subgroups, based on intrinsic similarities of their diseases. The branches of the tree showed the disease characteristics used to divide the patients. And each subgroup was labeled with the outcomes of the patients 3 years after diagnosis: 1 as recurrence and 0 as disease-free

Out of 66 leaves in total, 60 leaves contained patients only with or only without recurrence (no mix), indicating that in most cases, the combined attributes that describe a group of patients were sufficient to predict for a finite outcome. Six leaves contained mixed populations of patients, indicating that for these subgroups, additional attributes are required to further distinguish the disease-free and disease-recurred status.

Independent validation of the classifier’s performance was achieved using a set of 65 patients for whom the values for the Smad4 attribute were missing. In the event that a prediction process reaches a branch with a missing Smad4 value (or any other missing value), the classifier assigned the consensus results of all lower branches as the final prediction. Interestingly, the classifier predicted correctly for 55 out of the 65 patients, an accuracy of 85%. Table 3 summarizes the predictions, sensitivity, specificity likelihood ratios and predictive values. For the disease-relapsed status, the classifier achieved 40% sensitivity (95% CI: 16.43% - 67.67%). For the disease-free status, the classifier achieved 92% specificity (95% CI: 80.75% - 97.73%). Respectively, these values are notably higher than the penetrance (23.08%) and percentage of disease free patients (76.92%). These results suggest that the classifier was capable of distinguishing disease outcomes for most patients in the independent validation set.

Table 3 Predictions for the Smad4 null dataset (top) and a truth table showing the performance of this prediction (bottom)

3.3 Pathological nodal stage, stroma percentage and TGFβ signaling are predictive attributes of early breast cancer recurrence

Among the 55 attributes that we used, the Decision Tree classifier selectively presented 20 disease attributes on 9 levels. These attributes are marked with an asterisk (*) in Table 1. The structure captured several well-documented traits of breast cancer recurrence. The first attribute used to divide patients was the status of lymph node involvement (pN2), highlighting lymph node positivity as the most relevant attribute to early breast cancer recurrence. The classifier splits patients into 2 groups; defined as pN2 = 0 (not spreading to lymph node) and pN2 = 1 (containing all patients with lymph node involvement, regardless of the level of involvement). This is highly consistent with clinicians’ emphasis on lymph node involvement when making prognosis for breast cancer recurrence.

In addition, we also observed that stroma percentage (Fig. 1, stroma_perc) was the only secondary attribute appearing on both branches, following the division based on pathological node status. This indicates that, alongside lymph node status, stroma percentage was an utmost relevant attribute for all cases. For both branches, the classifier divided patients into multiple groups based on stroma percentage, suggesting that tumor-stroma interaction levels define different subgroups of breast tumors, with respect to early breast cancer relapse. Notably, the classifier identified a subgroup of 11 disease-free patients who had no lymph node involvement (pN2 = 0) and low stroma percentage (stroma_perc = 0% or 10%). This is consistent with the notion that patients with low grade, well-encapsulated tumors tend not to develop early relapse (Esposito et al. 2009).

Aside from lymph node status and stroma percentage, the classifier also highlighted several molecular characteristics, commonly used in the clinic for defining breast cancer subtypes and prognosis, as being determinant for status of early breast cancer relapse. These include the Estrogen Receptor α (ER_percpos), Progesterone Receptor (PgR_percpos), HER2, Ki67 (Ki67_Mean) and clinical tumor stage (CTstag) (Table 1 and Fig. 1). In multiple branches of the tree structure, we also found TGFβ receptors (type I and type II) as well as nuclear Smad4 and phospho-Smad2. In particular, TGFβ receptor II and Smad4 were the third level attributes of their respective branches. These results highlight the subgroup-specific prognostic values of TGFβ signaling components. Equally importantly, these results also indicate that TGFβ signaling components are better attributes than many of the commonly used clinical criteria (those not shown in the tree, Table 1) when predicting for early breast cancer relapse.

4 Discussions

In this study, we took a data mining approach to generate a Decision Tree classifier that can predict for breast cancer relapse status within the first 3 years following diagnosis. The tree classified patients into disease-free or disease-relapsed categories. The tree subdivided patients, using disease characteristics that display a defined and relevant threshold for disease recurrence (Information Gain Ratio = 0.1), and listed these characteristics in hierarchy order. As such, the model building process was also a “feature selection” process that helped identify important disease characteristics.

The classifier identified pathological nodal status as the most relevant feature to disease recurrence. While we supplied 3 different ways to categorize lymph node statues to the algorism, including pN2 (binary attribute of lymph node involvement), pNstag2 (integer attribute denoting pN0, pN1, pN2, pN3 and pNx), pNstag (integer attribute further subdividing each pNstag2 stage), the Decision Tree classifier identified pN2 as the only attribute among the three that was relevant to early breast cancer relapse. This indicates that, lymph node involvement is relevant to predicting early breast cancer relapse, independently, of the number of nodes involved. This is also highly consistent with the longstanding notion that pathological lymph node status is the most significant predictor of breast cancer recurrence (Aubele et al. 1995). As such, this fact validates the capacity of the Decision Tree classifier to identify and hierarchically present important features in our dataset.

Stroma percentage showed as the only second level attribute of all branches while TGFβ signaling components showed in various branches on lower levels. Current literature suggests that stroma percentage and TGFβ signaling components are relevant to breast cancer recurrence, but their predictive values differ, or even contrast, depending on the context. The Decision Tree classifier not only identified these attributes to be highly relevant, but also provided detailed description of the individual contexts.

With respect to the model performance, the classifier achieved over 80% precision for predicting a disease-free status, but only 34.15% recall for predicting early recurrence. This suggests that additional attributes are needed to better describe patients with early recurrence. Potentially, including immunohistochemistry scores of additional oncogenic or tumor suppressive signaling pathways, such as those of PI3K-AKT-mTOR, EGFR, p53 or Rb, could improve the classifier.

Nevertheless, the performance is comparable and potentially better than existing methods. For the Smad4 null independent validation set, the Decision Tree classifier predicted correctly for 85% of the patients in the Smad4 null validation set. In particular, 40% of the patients predicted to have early relapse within 3 years indeed had relapse. By comparison, another study using the Breast Cancer Index (BCI), a well-curated method to predict outcomes for ER+, lymph node negative (LN-) patients, classified patients in 3 groups of increasing risks of distant recurrence; using a combination of HOXB13:IL17BR gene expression ratio and molecular grade index (Jerevall et al. 2011; Ma et al. 2008). In 2 different patient cohorts, the estimated percentage of patients classified into high-risk group by BCI, and developed distant relapse within 5 years are 2.6%–21% and 14.6–33.3%, respectively (Zhang et al. 2013). BCI and the Decision Tree classifier have different advantages. BCI is capable of predicting for distant relapse and overall survival for various endpoints, but only for ER+, LN- patients. The Decision Tree classifier can be applied to all types of patients but predicts for 3-year relapse as its current design stands. However, predicting for other endpoints can be easily done, as it only requires creating a new target attribute for that endpoint. As such, the Decision Tree classifier could potentially be a powerful prognostic tool. Especially, the classifier can be easily adopted in different academic and clinical settings, as the attributes that we used are empirical and easy to assess. All nominal attributes, such as stage and grade, are assessed based on established quantitative methods in clinical practice at the time of diagnosis. All numeric attributes are established from quantitative immunohistochemistry staining.

In summary, we generated a Decision Tree classifier that hierarchically organizes breast cancer disease characteristics based on their relevance to early breast cancer relapse. One can easily trace down the tree structure to obtain the description of the intrinsic similarity of each subgroup of patients. The classifier also highlights the prognostic values of pathological nodal status, stroma percentage and TGFβ signaling components. To our knowledge, this is the first Decision Tree model that utilizes standardized disease characteristics that can be easily obtained by different clinics.