Introduction

Clinical observations of HCC across the world suggest differences in causes, such as HBV, HCV, alcoholism or aflatoxin B1 exposure [1], as well as locations [2]. However, a common major difficulty in the assessment of the natural history in many studies is the arbitrary time at which diagnosis is made in each patient. This includes early identification due to screening, or to diagnosis later in the disease progression consequent on complications from tumor growth, or from hepatic failure. Many scoring and classification systems have taken the twin issues of cirrhosis and tumor extent into account, including those of Okuda, CLIP, BCLC, CUPI, SLiDE, JIS, and Tokyo scores, amongst others [311] and several papers have recently compared multiple staging systems [1223]. They have been mainly intended to aid in patient selection for surgery, loco-regional therapy such as RFA or TACE, or palliative care. However, in the US, where there is no national screening program, the majority of HCC patients present at an advanced phase of their disease, either due to the severity of the underlying cirrhosis or the extent of their HCC. There has not so far, to our knowledge, been a large prospective study of these advanced unresectable HCC patients; although, several small series having evaluated prognostic factors in these patients [8, 20, 2429].

The difficulty of analysis in HCC is compounded by the fact that tumor growth can cause a variety of complications due to direct tumor extension and a reduction in residual hepatic function can result from either the associated cirrhosis or from the tumor extension, so that assumptions based on independent variables such as conventional regression analysis may not be valid. An alternative strategy to address the problem of variation in the time of diagnosis is to use a subset analysis strategy. In this approach, the major variables relating to the stage in the natural history are taken into consideration to identify more homogenous and comparable populations at the time of clinical presentation. The impact of the variable of interest on clinical outcome is then compared in each patient subset. This approach requires a substantial sample size and uniformity in both data collection as well as treatment approach.

The recent availability of information obtained at baseline assessment with prolonged follow up of clinical outcomes from an exceptionally large case cohort of patients provides an opportunity to use this strategy. All the patients were referred to a single physician (BIC) at one institution over 17 years for medical management of unresectable HCC. We have addressed the hypothesis that HCC represents heterogeneous disease groups.

If valid, the clinical implication for this hypothesis is that patients with identifiable profiles or phenotypes at disease presentation could be anticipated to have differences in prognosis. We have also derived a potentially useful clinical scoring system from a purely data-derived approach that was harmonized with the physician-driven intuitive strategy, to predict the prognosis of an individual case and contrast the utility of the two approaches.

Methods

Clinical methods

This prospective 17-year experience was acquired in HCC patients being seen by a single liver oncologist. 967 patients with HCC that were considered unsuitable for resection or liver transplantation were managed with TACE. On initial clinical evaluation, all patients had baseline CBC, liver function tests, alpha-fetoprotein (AFP), hepatitis serology, serum creatinine, physical examination, liver and tumor biopsy, and a triphasic helical computed axial tomography (CT) scan of the chest, abdomen and pelvis. Liver and tumor biopsies were obtained. The data and CT descriptors were entered into an HCC database intended for follow-up and analysis. All patients were followed for the time of their survival either through this clinic, or through close liaison through their primary care physician.

Statistical methods

Overall survival was the only end point used, defined as the time between the date of diagnosis and date of death. In order to devise a potentially useful clinical scoring system from this large dataset that could be applied to the prognosis of an individual patient, we undertook a detailed inspection of the plots of survival versus the laboratory values of bilirubin, alkaline phosphatase, SGOT, and GGTP. As the relationship of each parameter levels to survival had a non-linear correlation, we partitioned the data of each parameter into multiple range sets to better capture this reality in the scoring. We converted the recorded lab values to the same relative scale (0–100%) for each parameter, to permit comparison of value histograms. We found the overall distributions of all these parameter value frequencies were the sum of five components, expressed as five Gaussian distribution curves. Each of these ranges was characterized by the most frequently observed value (maxima of components). We calculated the average characteristic concentration value for each of the five Gaussian distribution ranges and converted them into relative numbers (input) with respect to the maximum for each parameter. The clinical descriptive parameters (cirrhosis, ascites, PVT, logAFP) were converted into a simple numerical scale and combined with the transformed characteristic laboratory variables input. A set of this combined data for each individual patient’s unique descriptors was used to characterize each patient and to derive appropriate weighting of each variable in the scoring system. Logistic regression analysis was then performed to predict a short- or long-survival category for each patient. To identify the best threshold between short- and long-survival, we repeated the regression analyses for several thresholds between 30 and 180 days and found the optimal prediction was for a threshold of 90 days. In the final step, we derived clinically usable scores by multiplying the characteristic inputs by their regression coefficients from the logistic regression model, to identify a variable-specific contribution to identify a single variable (sum) that if negative (<0), indicates a survival longer than 90 days; and if positive (>0), indicates a survival shorter than 90 days. This scoring system is entirely data-driven, rather than hypothesis-driven.

Results

Characterization of the heterogeneity of the patient data

This is data from a practice, not a clinical trial. In the optimal design of a gold-standard clinical trial, the study design minimizes the impact of many complex aspects of patient disease, while these aspects have clear impact on the information in the practice-based clinical data. This feature of the clinical practice databases requires using special methods for their analysis. In this article, we show that in the well-curated and ideally recorded data from normal clinical practice for cohorts of thousands of patients, the information about many aspects of the disease/liver cancer/HCC heterogeneity are indeed present and can be extracted.

In our approach to extracting the clinically relevant information for patients with HCC, we took advantage of our understanding of the important factors that characterize the uniqueness of information in the clinical practice data. Figure 1 summarizes schematically our understanding as interplay of the four factors: data heterogeneity, stochasticity, ambiguity, and coherence.

Fig. 1
figure 1

The four general factors describing the interplay of information categories in a clinical practice database

These terms summarize in a compact way the following features of the information in the clinical practice data: 1. stochastic factor is defined by the uncertainty of the actual disease status of patients that come for treatment in random stages of the disease. 2. Factor of heterogeneity (the one we analyze in more detail in this article) represents the possibility of distinct underlying biology or pathophysiology that might be grouped together under the umbrella of a common diagnosis. 3. Factor of ambiguity represents the possibility that the same values of one clinical parameter can have different significance for patients with different clinical or pathophysiological status. 4. Factor of coherence represents the common sense clinician’s understanding that if certain combination of one series of clinical parameters is found significantly more often (=coherently) than other combinations, then such an observation might have profound clinical relevance. Figure 1 emphasizes the fact that these four factors cannot be treated independently, because the actual status of one influences the clinical impact of all the others. Thus, different combinations of these four factors provide different angles through which we can see, process, and analyze informative structures of the raw clinical data (Fig. 1). When used in combination to design the mathematics of the data analysis, optimal extraction of the relevant information from normal clinical data can be achieved.

In this article, we approached the analysis through the perspective of a combination of heterogeneity and coherence of the information in the HCC dataset. We first asked whether there was evidence of heterogeneity of the disease in the HCC dataset. To approach this, we took advantage of the large number of patients in this database, which facilitates our construction of histograms of the clinical parameter values that represented the probability distributions of these values. We next examined whether the distributions of the values of these parameters were multi-component in a way that will provide significant information about subsets of patients who carry the respective HCC disease subtypes.

Mathematical approach to the heterogeneity and coherence

We initially computed the histograms of the frequencies of levels of our selected clinical parameters. Inspection of these histograms (Fig. 2) showed that they are indeed multi-component. Next, we statistically tested the ‘fit’ of all histograms by a single Gaussian component. In all cases, the fit failed the normality and constant variance tests (these tests indicated with P < 0.001 that hypothesized that the parameter value histograms are single Gaussian distributions is false). We also recorded the regression coefficients from these single-component fits, to compare them to the regression coefficients from the multi-component fits described below. In all cases, the multi-component fit resulted in significant improvement of the regression (P < 0.01) and resulted also in passing the constant variance tests (which mean that the difference between the reconstructed and the actual histogram does not contain systematic deviations from zero).

Fig. 2
figure 2

Histogram depicting the heterogeneity in the HCC clinical data

The key issue for a mathematical approach to the quantitative description of the multiple components in the HCC clinical data histograms is to select the optimal number of components that combine into the overall observed distribution. We used an iterative procedure, in which we applied non-linear least square fitting of the actual histogram by the sum of the variable number of normal distributions that sum into the actual histogram of parameter levels for each parameter. We were seeking the optimal number of sub-distributions, which we define as the minimal number of the component normal distributions that is needed to fit the shapes of the histograms of all parameters. From the clinician’s perspective, this feature of our mathematical algorithm represents the quantification of the common sense clinician’s understanding that the disease is heterogeneous. This heterogeneity is more evident if it is found in all the parameters that are being studied and if mathematical treatment of each parameter is coherent with the treatment of all other parameters.

Figures 2 and 3 show the optimal result by repeating the fitting of all parameter histograms, we found that we could describe all these histograms with only five-component normal distributions. This indicates that there is not a limitless heterogeneity of HCC clinical data, but rather that the data for each parameter can be described in terms of only five groups or components. Thus, the data show us that we can treat the heterogeneity of information in HCC databases by dividing the data into six intervals, which describe internally homogeneous patient subsets.

Fig. 3
figure 3

Relationship of range 1–6 of the serum-based clinical parameters as derived by our data-driven definition of threshold values (see Table 1). Thus, index 1 for bilirubin represents values above 9.5, index 2 bilirubin values 4.5–9.5, index 3 range 2.0–4.4, index 4 range 1.0–1.9, index 5 range 0.5–0.9, and index 6 range 0–0.4 mg/dl. For ALKP, index 1 corresponds to range above 900, index 2 to range 435–900 U/100 ml etc. also for other parameters as shown in Table 1. The lines show that relationship between these range characteristics and median survival of all patients with parameter values in the respective ranges can be described as exponential, with statistical significance (R 2 > 9.4 and P < 0.0001)

This required a definition of five threshold values for each parameter. We indicate by the dotted arrows in Fig. 2 how we used the five-component decompositions of parameter histograms for this purpose. Each of the five-parameter ranges shown in Table 1 were derived from the points at which the component Gaussian distributions crossed over each other in Fig. 2. Our results indicate that while in the clinical practice data the different subgroups cannot be completely separated, our data-driven approach to describe these five thresholds of each parameter is the best choice that maximizes the separation of the coherent levels of parameter ranges. It is also clear that any other arbitrary way of choosing clinical ranges (for example dividing the parameter values into equal intervals) would result in suboptimal result, where the data from different categories are more mixed when compared to our optimal selection, which reflect best the actual structure of the heterogeneity of HCC data. This is the unique feature of our clinical data that we were able to quantitatively capture by our analysis method, which in turn facilitates our treatment of the data and its application to clinical prognosis. Thus, Table 1 ranges were determined as is shown by arrows in Fig. 2 for each parameter (ALKP, AFP, GGTP, SGOT).

Table 1 Data-driven scoring system for survival prognosis

Validation through reconstruction of short- and long-survival categories

While the coherence factor of a common number of components in the distributions of clinical parameter data helps to extract the relevant information about the heterogeneity of the ranges of these clinical parameters, we still needed to validate the clinical relevance of this heterogeneity. First, we show in Table 1 that there is a strong correlation between the ranges 1–6 of levels of all relevant clinical parameters and median survival, as calculated for patients with the levels of individual parameters within each interval 1–6.

We used this as indication that it is possible to use the individual laboratory result to assign any patient to one of the six categories and then to develop the classification scheme allowing to characterize the patient as short- or long-surviving. To this purpose, we used the logistic regression algorithm with variable selection to determine the optimal combination of the serum-based and pathophysiological parameters. Moreover, we again adopted the iterative optimization approach to derive data-driven rather than arbitrary definition of the short- and long-survival. As is described in “Methods”, we obtained the best classification performance of our logistic regression algorithm for setting threshold between the short- and long-survival at 90 days. With this threshold, we were able to develop a logistic discrimination model that in 10-fold cross-validation classified correctly 85% cases (ROC area 0.79). For convenience and application in clinical practice, we converted the optimized result into the simple prognostic scheme, which is described in the next paragraph.

A data-driven analysis approach to survival prediction

In order to complement the clinically intuitive sub grouping strategy above, with a purely data-driven strategy, we used a stochastic approach (described in the “Methods”) to identify six characteristic ranges of serum bilirubin, alkaline phosphatase, SGOT, and GGTP in combination with the presence or absence of cirrhosis, ascites, PVT, or abnormal AFP to derive scoring values of each variable in a scheme to predict long- or short-outcome of survival. Optimization procedures identified 90 days of survival as providing the greatest discrimination. Using dichotomization at this time point, scores for each variable are presented in Table 1. We illustrate contrasting applications of this model, in two different patients. At baseline, patient A had no PVT (code 0), with cirrhosis (code 1) with ascites (code 1), baseline AFP of 537 ng/ml, bilirubin 2.0 mg/dl, alkaline phosphatase 155 U/100 ml, SGOT 168 U/100 ml, and GGTP 121 U/100 ml. Using Table 1, the scoring for this patient is PVT score 0, cirrhosis score −40, ascites score 75, AFP score 56, bilirubin score −235, alkaline phosphatase score −115, SGOT score −35, and GGTP score 15. The HCC score calculation for this patient is 290 + 0 − 40 + 75 + 56 – 235 – 115 − 35 + 15 = 11. As this result is positive, the scoring predicts class “A”, survival, with prognosis shorter than 90 days. In contrast, at baseline evaluation, patient B had PVT only on the left side (code 1), with cirrhosis (code 1), without ascites (code 0), baseline levels of AFP 536 ng/ml, bilirubin 0.3 mg/dl, alkaline phosphatase 263 U/100 ml, SGOT 83 U/100 ml, and GGTP 98 U/100 ml. Applying this information to Table 1, the PVT score is −15, cirrhosis score −40, ascites score 0, AFP score 55, bilirubin score −460, alkaline phosphatase score −90, SGOT score −50, and GGTP score 15. The HCC score calculation for this patient is 290 − 15 − 40 + 0 + 55 − 460 − 90 − 50 + 15 = −295. As this result is negative, the scoring predicts class “B”, survival prognosis of longer than 90 days.

In a 10-fold cross-validation, this model correctly predicted survival in 80% of cases with a true positive rate of 94% for long survival and 36% for short survival. To verify the statistical significance of our model, we also performed a 10,000-fold permutation of the short- and long-survival prognoses and tested the likelihood of obtaining the result of the model purely by chance. In none of these multiple random permutation tests, were results comparable (equal) or better than the model obtained either for the short- or for the long-survival class. Therefore, the significance of our results in terms of P value is better than 0.0001 and that this quantitative data-driven scoring system has high degree of predictability.

For direct comparison of two approaches, namely a clinical intuition-driven when compared with a data-driven partitioning of patients into clinically relevant groups, we calculated median survival and 95% confidence interval for each sub-group of patients with their parameters within the ranges shown in Table 1. We found that value 1.2 × [(Average of all means in Table 1) − 115] correlates best with the actual survival time (data not shown). For our two example patients these prognostic calculations are as follows:

$$ {\text{Patient A prognosis}} = 1.2 \times [\left( {1/8.0} \right)*(313 + 193 + 117 + 148 + 116 + 212 + 122 + 272) - 115] = 86\,{\text{days}} . $$
$$ {\text{Patient B prognosis}} = 1.2 \times [\left( {1/8.0} \right)*(162 + 193 + 341 + 203 + 447 + 140 + 213 + 272) - 115] = 158\,{\text{days}} $$

This is in reasonable agreement with the predicted survival of 90–180 days and 150–270 days from Fig. 1.

Discussion

The present series is the largest single site report of consecutive patients with unresectable HCC in the U.S.A lengthy follow up of time from diagnosis to death in this unique cohort provides insight into the pathophysiology of HCC progression, and suggests that there is heterogeneity of disease states that are included within this diagnosis to influence rate of disease progression. Overall, once HCC is not respectable or transplantable, the prognosis at time of clinical evaluation is limited (9 ± months, median ± 95% CI). However, within this cohort, there are features that determine whether this prognosis is restricted to terminal patient care (<3 months), or those whose life can be expected to extend for longer. We suggest that this discrimination is of clinical value in selecting between therapeutic choices of terminal palliative care, or an intervention designed to both ameliorate symptoms and possibly prolong survival.

HCC is an unusual cancer in that its progression can be detrimental to the patient due to two closely intertwined but discrete processes, namely tumor growth and decreased residual hepatic function, each of which can be evaluated from a variety of indices and each of which can independently be the cause of death. This was recognized in one of the earliest classification systems [3]. It is also evident that the time of diagnosis within the disease progression in any individual patient is arbitrary. Thus, an overall analysis of individual factors influencing survival is complex and confounded by interdependency between variables. The size of this patient cohort has permitted an alternative to the usual analysis strategy, by creating mutually exclusive categorizations for each patient into subsets based on CT scan and biochemical variables obtained at the time of diagnosis and comparing times of survival between subsets. We have followed a progressive strategy. We have taken variables selected to represent characteristics of the tumor (number, size, evidence of extension as judged by portal vein thrombosis) and residual hepatic function (liver function tests). We infer that HCC in the absence of cirrhosis represents a different pathophysiological entity with different rate of tumor progression than HCC in the presence of cirrhosis.

Using a similar strategy and the same logic, we also suggest that the presence of AFP above the normal level for our lab, is a similar negative prognostic indicator of survival in each of the patient subgroups, when defined by liver function and tumor extension, as previously described [30, 31]. Thus, a similar inference can be made, namely that AFP positive and AFP negative HCC represent discrete phenotypes, with different rates of tumor progression. Most classification systems have also recognized these same variables as being predictors of long-term outcome. This analysis differs, in that it is confined only to patients who were considered untreatable and deemed to be unresectable and untransplantable (or untreatable by any local ablative technique RFA, PEI), as well as the large size of this cohort study in the US. It also attempts to deal with the multiple factors involved (such as cirrhosis, ascites, liver function tests), which not being independent variables, require a slightly different approach to analysis.

Several new HCC tumor markers have recently been clinically evaluated and found to be useful in HCC diagnosis, especially AFP-L3 and DCP [29, 3236] as well as possibly hsp27, osteopontin, and glypican 3 [3741]. The rapidly evolving fields of genomics and proteomics are now finding HCC-associated gene products that will likely find future usefulness in further defining HCC prognostic subsets [4244], including metastasis potential [4547].

Clinical anecdotal case reporting is subject to a variety of biases. The problems that are inherent in cohort analysis are present here, but have been minimized to the extent possible. At the outset of creating a specialized clinical HCC program, a prospective study design was developed, of creating a phenotypic description of each patient with common historical, clinical, CT scan imaging, tumor marker, and biochemical measures. Each consecutive patient referred to a single clinician has been included, and most importantly, follow up to death from cancer is almost complete, due to a prospective and involved commitment of the physician and nurses to patient care and follow-up. This has been sustained over the 17 years of the clinical service. The regimen of cisplatin-TACE was used throughout and the patient cohort was homogeneous in that none were subjected to subsequent surgery. This follow-up information provides the very hard data end point of death.

There are, however limitations. The major bias arises from the success in the clinic in attracting patients from a wide geographic location based on physician referral advice. The treatments of cisplatin-TACE or transhepatic arterial chemo-embolization which were offered to these patients have only a small proven benefit on survival [47]. This is also a strength of this study, as TACE has then become a surrogate for studying contemporary natural history. Thirdly, quantitative approaches to understanding the pathophysiology of HCC growth and loss of residual liver function are in their infancy, and the measures used are known to be insensitive. Fourthly, any inferences drawn from this study have to be limited to the US population of HCC patients.

We have used a novel data-driven strategy that takes away arbitrary dichotomized decisions in analysis, to create a multivariate analysis in which each of the variables was normalized and contrasted against the others, to provide relative scoring. This was then used to dichotomize survival and identify longer or shorter survival relative to 90 days. We were surprised at the concordance of the continuous variable datasets to each identify six sub-populations and impressed by the ability and statistical significance to predict the outcome. We therefore consider that the relatively simple procedure presented in Table 1 augments other standard simpler classification systems. Thus, clinical selection for alternative therapeutic guidelines can be aided by a combination of two simple numerical algorithms.

Whether this extensive experience is considered from the perspective of a clinical algorithm of a tree-based subgroup analysis as is usual, or from a data-driven multivariate analysis (Table 1), it is apparent that the major predictor of a poor outcome is the serum bilirubin at the time of diagnosis. This observation implies that maintenance of residual liver function is more important than features attributed to tumor growth and extension. If anything, a previous subgroup analysis of the present series which estimated that 40% of HCC patients died from liver failure rather than from tumor progression [29], was an under-estimate. This reinforces the dual complexity of changes in liver function and HCC tumor growth.

In summary, by the time HCC has progressed to be not amenable to resection or liver transplantation, the prognosis is limited. There are, however, discrete characteristics that can be shown to influence survival. Of these, the three most dominant are PVT, elevated AFP, and loss of residual hepatic function (elevated bilirubin levels). Although several classification systems exist, these three factors predominate in prognostic significance for unresectable HCC. It is also reassuring that clinical intuition based on only three easily remembered variables is equivalent, from the perspective of clinical decision-making, to a rigorous, purely data-derived strategy of variable weighting. We also suggest that a bilirubin level >1.5 mg/dl and/or presence of PVT and elevated AFP should be taken into account in future trials of innovative therapies, to provide a strategy for stratification.