Introduction

Breast cancer is a leading cause of death for women worldwide [1]. When a breast cancer patient is faced with the decision of whether to undergo surgery, it is of immense help to have knowledge of the expected survival time (beyond surgery) as a function of the patient’s age and features of the primary tumor, such as size, grade, and whether it has spread to one or multiple nodes. Such knowledge can significantly aid in managing the patient’s quality of life along with choosing among postoperative treatment options. Kaplan–Meier (KM) theory [2] has long been the standard nonparametric method of computing survival probability curves from right-censored (or uncensored) data and has been employed by many past studies on estimating effects of cofactors such as tumor features (size, grade, node spread status) and patient age at the time of surgery on post-surgery survival chances [3,4,5]. However, the accuracy and practical application of the method requires a large dataset of patients comprising maybe tens of thousands to hundreds of thousands of subjects studied over a long period of time [6, 7]. A moderately large dataset (hundreds to a few thousand subjects) segmented into narrow ranges of patient age, tumor diameter, specific grade, and node status quickly becomes too sparse for reliable application of KM.

In this work we address the above challenge by categorizing only the age variable while keeping tumor diameter and time of survival as continuous variables. The post-surgery survival probability of a patient is expressed as a parameterized function of the two continuous variables, i.e., (1) tumor diameter at the time of surgery, and (2) the amount of time (number of years) beyond surgery, with the parameter values being dependent on the categorical cofactors (patient age group, tumor grade, and node status at the time of surgery). Optimized parameters are determined by representing the patient-to-patient distribution of diameters with a smooth right-skewed function [8], and then matching the resulting diameter-marginalized survival function with the Kaplan–Meier mean survival curve computed from clinical data. The procedure yields the survival probability of a patient as a function of time after surgery, and cofactors such as age group, tumor diameter, and node status.

Finally, we show that the observed fraction of patients with positive lymph nodes (i.e., with metastatic spread to at least one of the nodes) can be quantitatively explained by a simple tumor growth and metastasis model. Using two different tumor growth models from the literature and assuming metastasis rates proportional to tumor surface area, we utilize the observed fraction of node-positive patients in our clinical dataset to determine the metastasis rate constant and the distribution of its patient-to-patient variation.

Data

In this work, we analyze the METABRIC (Molecular Taxonomy of Breast Cancer International Consortium) breast cancer dataset on around 2000 patients [9]. This dataset has been the subject of several recent studies, ranging from genomic architecture [9] and signaling pathways [10] to cancer prognosis using machine learning [11] and evolutionary modeling of breast cancer [12]. For our own purposes, we focus on six variables: patient age, tumor diameter, survival status (censored or uncensored), survival time, neoplasm histologic grade (henceforth referred to simply as tumor “grade”), and the Nottingham Prognostic Index (NPI). Tumor grade in this dataset is a three-level categorical variable defined as either 1 (well-differentiated, more benign), 2 (moderately differentiated), or 3 (poorly differentiated). Node status (“node”), was also included as a three-level categorical variable defined as node = 1 (no nodes affected by metastatic spread), node = 2 (1–3 nodes affected), or node = 3 (more than 3 nodes affected). The Nottingham Index, NPI, is the sum of node, grade, and a small contribution proportional to the tumor diameter. Thus, it is straightforward to determine the node status N for each patient.

After removing patients with incomplete data, we ended up with 1875 patients in our dataset, which had the following frequency breakdown: node = 1 (968), node = 2 (603), node = 3 (304); grade = 1 (167), grade = 2 (763), and grade = 3 (945). Figure 1 summarizes the distribution of tumor diameter and (uncensored) survival times for the different node and grade categories. From the boxplots of Fig. 1a and b, we can see a positive association between diameter size and node status and a negative association between survival times and node status, i.e., with higher levels of metastatic spread to the nodes the diameter distribution changes to larger sizes and survival times shift to smaller values, respectively. Also, the probability of survival as a function of tumor grade is qualitatively similar to the probability of survival as a function of node status, although there are some quantitative differences, as discussed in the sections below.

Fig. 1
figure 1

Exploratory data analysis of the METABRIC dataset. Boxplot representation of the distribution of tumor diameter and survival times (uncensored) for the three levels of node status and tumor grades

Kaplan–Meier survival

From the censored survival data, it is straightforward to compute the survival probability (along with uncertainty estimation) using KM theory. The mean KM survival probability for the whole dataset of 1875 patients, as well as for subsets segmented according to node status, age group, and tumor grade, is summarized in Fig. 2. The uncertainty in the estimate of overall survival is represented by the 95% confidence bounds in Fig. 2a. Such uncertainty is known to increase proportionally to the time of survival and decrease inversely proportionally to the square root of the number of patients at risk [2]. Thus, for categories with a low number of samples, the uncertainty in the KM estimate can be significantly higher than the tight bounds indicated in Fig. 2a.

Fig. 2
figure 2

Mean Kaplan–Meier survival probability for the METABRIC dataset: a whole dataset, b for each level of node status, c for each of the three age groups, d for each level of tumor grade. In a the 95% confidence margins are also included

Prior to discussing survival results, we would like to note that age was categorized into three levels (or age groups), i.e., < 55 (age group 1), 55–65 (age group 2), and > 65 (age group 3), with 611, 496, and 768 patients, respectively. Such choices of groups were based on an exploration of different segmentations and performing logrank tests [13] to decide if statistically significant differences in survival existed among different age groups. For instance, if we segment the data into four age groups, < 50, 50–60, 60–70, and > 70, we find no significant survival difference between the first two age groups, but significant differences among all other pairs. The chosen three-level age categorization in this work was not based on any rigorous statistical procedure, but was rather the result of seeking a segmentation that satisfies: (1) a small number of categories; (2) round values of age boundaries; (3) significant survival difference between each age group pair (by logrank test); and (4) significant patient population in each age group.

From the KM curves of Fig. 2b–d several interesting trends were noted. The overall decreasing survival probability with increasing node status, age group, and tumor grade is intuitive, and thus, not surprising. However, some results were unexpected. For instance, Fig. 2c shows that for times less than 8 years, the survival probability is higher for age group 55–65 than for age group < 55. Logrank tests [13] indicate significant survival differences among the three age groups (< 55, 55–65, and > 65), although the difference between the first two age groups is relatively smaller than their difference from age group 3. Such results are consistent with the survival curves of Fig. 2c. Also, as Fig. 2d indicates, the survival differences among the three tumor grades are smaller than those between the node categories. Thus, in the following analysis, dependence on node status has been explored more extensively.

As shown in Fig. 2b–d, the Cox proportional hazard model [14, 15] does not hold as a function of cofactors such as node status, age group, or tumor grade. Figure 3, which plots the hazard function for the three different node status and the three different age groups, supports this finding. Statistical tests on hazard proportionality [16] on node, grade, and age group cofactors resulted in significantly small p-values in all cases, thereby confirming the non-applicability of the Cox model for this dataset.

Fig. 3
figure 3

The hazard function corresponding to the KM survival curves of Fig. 2b and c. The hazard ratio between different curves vary with time, thus showing the inapplicability of the Cox proportional hazard model for cofactors such as node status and age group

Distribution of tumor diameters and the survival function

The main strategy in this work, i.e., fitting the distribution of diameters with a smooth probability distribution, arose from noting that in the METABRIC dataset the diameter distribution for any category segment appears to follow the same qualitatively similar right-skewed distribution. This is exemplified in Fig. 4 for four different categories, i.e., (a) the entire dataset of 1875 patients; (b) patients with tumor grade 3; (c) patients of age group 1 (< 55) with node status 1; and (d) patients with node status 3.

Fig. 4
figure 4

Density histogram of tumor diameters for four different stratifications and the corresponding best fits with the Dagum probability distribution (see Eq. (1) in text)

To fit the right-skewed histograms of Fig. 4 we chose the following three-parameter probability distribution function, originally used by Dagum [8]:

$${\varphi }_{pdf}\left(D\right)=\frac{\left(ap/b\right){\left(D/b\right)}^{ap-1}}{{\left\{{\left(D/b\right)}^{a}+1\right\}}^{p+1}}$$
(1)

Thus, for each category subset, we have a set of Dagum parameters \(a,\,b,\,p\) that best fits the corresponding histogram of diameter distribution, which results in the best fit \({\varphi }_{pdf}\left(D\right)\) for that category.

As mentioned in the introduction, the reason for pursuing a smooth diameter distribution in this work is twofold: (1) lack of enough data within an over-specified category with a narrow range of diameters; and (2) resulting increase in the number of model parameters. The idea behind our approach is that each KM survival curve of Fig. 2 is a marginal distribution of a survival function \(\left(S\right)\) of two continuous variables, survival time (t) and tumor diameter (D), that has been marginalized (or integrated) over variable D. Next, we selected a functional form for S as a function of two continuous variables, t and D. To this end, we segmented the dataset according to many different (node, age group, grade, diameter range) classes and explored the patterns of KM survival probability. From such analyses we found that survival probability as a function of t and \(D\) can be modeled by the function:

$$S\left(D,t;\,z\right)={e}^{-k\left(z\right){D}^{n}{t}^{\alpha \left(z\right)}},$$
(2)

where z represents the set of cofactors such as node status, age group, and tumor grade. In the above model, the exponent n is kept independent of the cofactors z, while the parameters k and \(\alpha\) are both z-dependent. The hazard function [17, 18] corresponding to the above model is readily obtained as:

$$h\left(D,t;z\right)=\alpha \left(z\right)k\left(z\right){D}^{n}{t}^{\alpha \left(z\right)-1}.$$
(3)

The z-dependence of the parameter \(\alpha\) (along with marginalization over diameters \(D\) as discussed in the following section) leads to the breakdown of the Cox proportional hazard model, consistent with previous discussions.

Optimizing parameters and survival as a function of diameter

The optimized survival parameters \(n, k\left(z\right),\) and \(\alpha \left(z\right)\) are obtained by matching the marginalized survival, defined by:

$${S}_{\text{m}\text{o}\text{d}\text{e}\text{l}}\left(t;z\right)=\int {\varphi }_{pdf}\left(D\right){e}^{-k\left(z\right){D}^{n}{t}^{\alpha \left(z\right)}}dD,$$
(4)

with the corresponding KM survival curves, e.g., as shown in Fig. 2. From numerical experiments, we see that the root-mean-squared error (RMSE) between the KM survival curves computed from the data and the corresponding marginalized survival \({\text{S}}_{\text{m}\text{o}\text{d}\text{e}\text{l}}\) is low for a range of values of the triplet \(\left(n,\,k\left(z\right),\,\alpha \left(z\right)\right)\). Thus, to reduce the number of parameters, we constrain the parameter \(n\) to be of fixed value (independent of cofactors \(z\)) and optimize only \(k\left(z\right)\) and \(\alpha \left(z\right)\) for each cofactor combination of interest. In the analysis below, we have chosen this value to be 0.8, although a slightly different value of \(n\) (e.g., 0.7 or 0.9) would have also yielded comparable results. Table 1 lists the various cofactor sets we have explored in this study and the corresponding optimized survival parameters. Figure 5 displays the results for \({S}_{\text{m}\text{o}\text{d}\text{e}\text{l}}\) corresponding to the cofactor sets in Fig. 2. These results show good agreement with the KM curves of Fig. 2.

Table 1 Optimized survival parameters for different cofactor sets (fixed \(n=0.8\)). These parameters are to be used with Eq. (2) to estimate the survival probability for a given tumor diameter \(D\) (in mm) and a given time of survival \(t\) (in years)
Fig. 5
figure 5

Model-predicted mean survival probability using Eqs. (2) and (4) for the four cases corresponding to Fig. 2. Top left whole dataset, top right for each node status, bottom left for each age group, bottom right for each tumor grade. For direct comparison, top left also includes the KM mean survival curve from Fig. 2a

With the optimized parameters shown in Table 1, Eq. (4) can readily be used to estimate survival probability as a function of D. Figure 6 illustrates such a prediction for example cases. When comparing Fig. 6a and b, we see that the 5-year survival probability for age group 2 is higher than that of age group 1, while the trend reverses for the 15-year survival probability, which is consistent with the survival curves of Fig. 2c and 5c. The relative ordering of the curves for other cofactors is also as expected.

Fig. 6
figure 6

Model-predicted survival probability as a function of tumor diameter using Eq. (2): a KM5 curves for three age groups, b KM15 curves for three age groups, c KM10 curves for the three nodes, d KM5, KM7, KM10, KM15 curves for node 1, age group 3

As a more direct rationalization of the choice of the optimized parameters in Table 1, we have segmented the diameters into 14 intervals and computed the mean survival for each interval. Figure 7 compares these results (for arbitrarily chosen cases) with predictions from our model survival function (Eq. (2)) using parameters from Table 1. The scatterplots show large fluctuations as a function of diameter, the direct result of a relatively small number of data points within each segment. Additionally, reliable results for larger diameters are absent due to the lack of significant data for D > 80 mm. Nonetheless, the consistency of the model prediction is apparent in each case and it provides confidence in the survival function and the parameters derived above.

Fig. 7
figure 7

Results from direct calculation of mean KM survival probability for various diameter segmentations within specified cofactor categories (open circles) compared with model-predicted probability (dashed line) using Eq. (2). The four cases have been arbitrarily chosen for illustration purposes. Top left whole dataset for KM 5 year, top right whole dataset for KM 10 year, bottom left node status 2 for KM 10 year, bottom right node status 1 and age group 3 for KM 10 year

Fraction-node-positivity and metastasis rates

To make a quantitative connection between node positivity and metastasis rates, we computed the fraction of patients with at least one positive node (i.e., node status 2 or 3) for small ranges of tumor diameters. The results are shown as data points (open squares) in Fig. 8. If we assume metastasis to occur homogeneously from the tumor surface [19, 20], the rate of metastasis for a tumor of diameter D should be equal to \(m{D}^{2}\), where \(m\) is a metastasis rate constant. Assuming metastasis to be a Poisson process [21], the total probability of spread to any node during the lifetime of the tumor, i.e., during its entire growth time from size 0 (at the time of inception) to size D (at the time of surgery) is given by:

Fig. 8
figure 8

Fraction-node-positive results for the whole METABRIC dataset. (open squares) Fraction of patients with at least one node affected (i.e. node status 2 or 3) for different diameter ranges directly computed from the METABRIC dataset; (dashed line) best fit using Eq. (5) that was averaged over the joint lognormal distribution of tumor growth and metastasis rates (Eq. (7)). The optimized metastasis parameters are listed in Table 2

Table 2 Optimized parameters for the marginal lognormal distribution of metastasis rate\(m\)obtained by fitting the observed fraction of node-positive patients in the METABRIC dataset as a function of tumor diameter\(D\)(open squares in Fig. 8)
$$P\left(\text{node}\,\text{positive}\right)=1-{e}^{-{\int }_{0}^{{t}_{1}}m{D}^{2}\left(\tau \right)d\tau },$$
(5)

where “node positive” means that at least one lymph node has been affected by metastatic spread, \(D\left(\tau \right)\) is a function that represents how the tumor grows with time, \({t}_{1}\) is the age of the tumor at the time of surgery, and \(D\left({t}_{1}\right)=D\) is the size of the tumor at the time of surgery. Given that growth rates cannot be obtained from survival data, we need to use growth models developed in the literature. To this end, we considered two different growth models for the primary breast tumor, i.e., a Logistic growth model [22] and a Gompertz growth model [23]. In both these models, the tumor growth rate follows a lognormal distribution with finite standard deviation. Explicitly, the growth models (in terms of diameter) are as follows:

Logistic growth model [22]:

$$D\left(\tau \right)={D}_{\max}{\left[1+\left\{{\left({D}_{\max}/{D}_{cell}\right)}^{3/4}-1\right\}{e}^{-\kappa \tau /4}\right]}^{-4/3}$$
(6a)

Gompertz growth model [23]:

$$D\left(\tau \right)={D}_{cell}{\left({D}_{\max}/{D}_{cell}\right)}_{ }^{1-{e}^{-\kappa \tau }}$$
(6b)

In the above, \({D}_{\max}\) is the theoretical maximum diameter and \({D}_{cell}\) the diameter of a single cell (tumor size at time 0). In Eq. (6a), the growth parameter \(\kappa\) follows a lognormal distribution with mean 1.07 and standard deviation 1.14 [22], while in Eq. (6b) the growth parameter \(\kappa\) follows a lognormal distribution with mean − 2.9 and standard deviation 0.71 [23]. We would like to note that in Norton’s original paper [23], time origin \(t=0\) was defined when tumor reaches a size of \(N\left(0\right)=4.8\times {10}^{9}\) cells, while in our definition (Eq. (6b)) time starts when tumor is of size 1 cell. This translational shift in time does not cause any change in the quantitative interpretation of the growth parameter \(\kappa\) (which Norton calls \(b\)). In the analysis below, we chose \({D}_{\max}=180\) mm and \({D}_{cell}=0.0124\) mm (using a spherical cell volume of \({10}^{-6}\text{m}{\text{m}}^{3}\)).

The metastasis rate constant \(\left(m\right)\) is also expected to vary from person to person, and we assume a lognormal distribution. Past studies indicate a positive correlation between tumor growth and metastasis rates [24, 25]. Thus, we assume that \(\left(\text{ln}\left(m\right),\text{ln}\left(\kappa \right)\right)\)follows a bivariate normal distribution with some positive correlation coefficient \(\rho\), i.e., \(\left(\text{ln}\left(m\right),\text{ln}\left(\kappa \right)\right) {\raise.17ex\hbox{$\scriptstyle\sim$}} N(\mu ,{\Sigma })\), with the mean and covariance matrices given by:

$$\mu =\left[\begin{array}{c}{\mu }_{m}\\ {\mu }_{g}\end{array}\right] \;\text{and}\; {\Sigma }=\left[\begin{array}{cc}{\sigma }_{m}^{2}& \rho {\sigma }_{m}{\sigma }_{g}\\ \rho {\sigma }_{m}{\sigma }_{g}& {\sigma }_{g}^{2}\end{array}\right],$$
(7)

where the subscripts “\(m\)” and “\(g\)” refer to metastasis and growth parameters, respectively.

The growth rate distribution parameters \(({\mu }_{g}, {\sigma }_{g})\) were chosen from literature values of marginal lognormal distributions for logistic growth [22] and Gompertz growth [23]. To determine the metastasis rate distribution parameters \({\mu }_{m}\), \({\sigma }_{m}\) we use the following strategy:

  1. 1.

    assume a positive value of the growth-metastasis log-log correlation (\(\rho\)) and keep it constant;

  2. 2.

    choose a specific value of marginal metastasis parameters \(({\mu }_{m}, {\sigma }_{m})\);

  3. 3.

    draw a large number (1000) of random pairs \(\left(\text{ln}\left(m\right),\text{ln}\left(\kappa \right)\right)\) from a bivariate normal distribution (Eq. (7)), compute \(P\left(\text{n}\text{o}\text{d}\text{e} \text{p}\text{o}\text{s}\text{i}\text{t}\text{i}\text{v}\text{e}\right)\) as a function of \(D\) (using Eq. (5)) for each drawn value of \(m\), average over all 1000 drawings, and compare this average \(P\left(\text{n}\text{o}\text{d}\text{e} \text{p}\text{o}\text{s}\text{i}\text{t}\text{i}\text{v}\text{e}\right)\) curve to the open squares of Fig. 8;

  4. 4.

    repeat steps (2) and (3) (for a fixed \(\rho\)) until the distribution-averaged \(P\left(\text{n}\text{o}\text{d}\text{e} \text{p}\text{o}\text{s}\text{i}\text{t}\text{i}\text{v}\text{e}\right)\) curve has minimum root-mean-square-error (RMSE) with respect to the open squares of Fig. 8.

For a given value of \(\rho\), we were able to find an optimized pair \(({\mu }_{m}, {\sigma }_{m})\), which yields a result quantitatively similar to the dashed curve of Fig. 8. Table 2 lists these optimized metastasis parameters for each of the two growth models for a few different values of the correlation coefficient \(\rho\). Table 2 shows that with increase in \(\rho\) the standard deviation \({\sigma }_{m}\) increases, while the mean \({\mu }_{m}\) remains unchanged. However, the metastasis parameter values are sensitive to the growth model and parameters, which is not unexpected, given noticeable differences between the growth parameters of refs. [22, 23]. Had a tumor growth model existed on the METABRIC population itself, that would have been the most appropriate to use for this dataset.

Finally, we would like to clarify that metastasis rates discussed here pertain only to spread from the primary tumor to the lymph nodes. There are additional processes whereby metastatic spread can occur from the primary tumor and the lymph nodes to distant organs. Extracting such rates from survival data would require complex biological models of direct and indirect spread and assumptions relating a critical level of metastasis to subsequent organ failure and death.

Summary

The aim in this work was to develop a model for predicting survival probability as a function of continuous time and tumor diameter for different cofactors such as node status, patient age group, and tumor grade. In order to overcome data sparsity shortcomings within small diameter ranges in a moderately-sized dataset like the METABRIC dataset, we adopted the strategy of representing the tumor diameter distribution among patients with a Dagum probability distribution [8], and then optimizing the model parameters to best match the corresponding KM mean survival curve. By analyzing the METABRIC dataset [9] we observe that the Cox proportional hazard model is not applicable for the cofactors of interest, i.e., the hazard ratio between different node statuses, age groups, or tumor grade levels do not remain constant as a function of time. Our model survival function intrinsically takes this into account by incorporating cofactor-dependent exponents (Eq. (2)), along with marginalization over diameter distribution (Eq. (4)). These parameters can be readily used to estimate the survival probability of a patient for any specified length of time beyond surgery. Such an approach was found to have accurate predictive power for mean survival probability for different cofactor combinations and was able to flexibly reproduce unexpected features in the data, e.g., the reversal of survival probabilities between age groups 1 and 2 as a function of time.

Finally, by studying the fraction of node-positive patients as a function of tumor diameter, we show how to decipher metastasis rates from the primary tumor surface to the lymph nodes (prior to tumor removal via surgery). More specifically, assuming known models for tumor growth rate \(\kappa\) from the literature [22, 23], assuming a metastasis rate \(m\) proportional to the tumor surface area, and assuming a bivariate lognormal distribution of \((m,\kappa )\), we determine the marginal patient-to-patient distribution of \(m\) (see Table 2).

A knowledge of the mean survival probability (along with uncertainty bounds) as a function of the primary tumor’s size, grade, node status, and patient age can be of immense help in managing the patient’s quality of life beyond surgery. Such knowledge can aid doctor’s recommendation and patient’s choice of postoperative treatment options. Additionally, knowing mean metastasis rates (and its variability) could enable doctors make more informed assessment on the progression of the disease, especially in cases where the metastatic tumors are too small to be detectable by current clinical means.