Introduction

The current standard therapy for chronic hepatitis C is 48 weeks of pegylated interferon (PEG-IFN) plus ribavirin [1]. Sustained virological response (SVR), defined as negative hepatitis C virus (HCV) RNA for 24 weeks after cessation of therapy, can be achieved by the current treatment regimen, but this outcome can be attained in only less than 50% of patients infected with genotype 1 HCV [2, 3]. Hemolytic anemia is a common side effect of ribavirin and is the major reason for dose reduction. Age, gender, baseline platelet level, baseline hemoglobin (Hb) level [4, 5], haptoglobin phenotype [6], drug dose [7], plasma concentration of ribavirin [8], apparent clearance of ribavirin (CL/F) [9], and an early decline in Hb concentration [10, 11] have been reported to contribute to ribavirin-induced anemia. Predicting the possibility of severe anemia before therapy or at the early phase of therapy can help modify ribavirin dosage, decrease the discontinuance rate for ribavirin, and raise the SVR rate.

Data mining is a method of predictive analysis that explores data to discover hidden patterns and relationships in highly complex datasets and enables the development of predictive models. Decision-tree analysis is a core component of data mining and predictive modeling [12], and it is utilized by decision makers in various business fields. Recent publications concerning decision-tree analysis in the medical field indicate its usefulness for defining prognostic factors in various diseases such as prostate cancer [13], diabetes [14], melanoma [15, 16], colorectal carcinoma [17, 18], and liver failure [19]. The results of decision-tree analysis are presented in the form of a flow chart, which is easy to use in clinical practice [20]. This analysis was also used to predict early virological response (undetectable HCV RNA within 12 weeks of therapy) and SVR to PEG-IFN plus ribavirin combination therapy in chronic hepatitis C [2124]. In the present study, we used decision-tree analysis to explore before- and during-treatment predictors of severe anemia during PEG-IFN alpha-2b/ribavirin combination therapy and used a prediction algorithm to try to identify chronic hepatitis C patients who are likely to develop severe anemia.

Materials and methods

Patients

This multicenter retrospective cohort study was supported by the Japanese Ministry of Health, Labour and Welfare. Data were collected from 1081 chronic hepatitis C patients who were treated with PEG-IFN alpha-2b plus ribavirin at Osaka University, Musashino Red Cross Hospital, Toranomon Hospital, Tokyo Medical and Dental University, Nagoya City University, Yamanashi University, and their related hospitals. The inclusion criteria applied in the present study were as follows: (1) infection by genotype 1b, (2) HCV RNA ≥ 100 KIU/ml by quantitative PCR (Cobas Amplicor HCV Monitor v 2.0, Roche Diagnostic Systems, CA, USA), (3) lack of co-infection with hepatitis B virus or human immunodeficiency virus, (4) lack of other causes of liver diseases such as autoimmune hepatitis and primary biliary cirrhosis, and (5) completion of at least 12 weeks of therapy. Patients received PEG-IFN alpha-2b (1.5 g/kg) subcutaneously every week and were administered a weight-adjusted dose of ribavirin (600 mg for <60 kg, 800 mg for 60–80 kg, and 1000 mg for >80 kg). The dosage of ribavirin was reduced from 1000 to 600 mg, 800 to 600 mg, or 600 to 400 mg when the Hb concentration decreased to less than 10 g/dl, and was discontinued when the Hb concentration decreased to less than 8.5 g/dl, based on the recommendations in the package inserts. No patient received erythropoietin or blood transfusion for the treatment of anemia. Anemia with Hb < 8.5 g/dl was defined as severe anemia in this study.

For the analysis, patients were randomly assigned to either the model-building (n = 691) group or the internal validation (n = 390) group. Consent was obtained from each patient. The study protocol conformed to the ethical guidelines of the Declaration of Helsinki and was approved by the institutional review committee. The baseline characteristics and representative laboratory test results are listed in Table 1. There were no significant differences between the clinical backgrounds of the two groups.

Table 1 Comparison of clinical parameters of model-building and internal validation groups

Laboratory tests

Blood samples were obtained before therapy and at least once every month during therapy, and were used for hematological tests, blood chemistry analyses, and determination of HCV RNA. Pretreatment levels of HCV RNA were quantified by Cobas Amplicor (Roche Diagnostic Systems, CA, USA).

Database of variables and decision-tree analysis

A database of pretreatment variables was created containing 3 variables from hematological tests (Hb, white blood cells, and platelets), 11 variables from blood biochemical tests (creatinine, albumin, aspartate aminotransferase, alanine aminotransferase, gamma-glutamyl transpeptidase, total cholesterol, HDL cholesterol, LDL cholesterol, triglyceride, fasting blood glucose, and alpha-fetoprotein), creatinine clearance (Ccr), serum level of HCV RNA, liver histology (activity, fibrosis), 3 variables from patient characteristics (age, gender, and body mass index), 2 variables from therapeutic factors (PEG-IFN alpha-2b dosage, ribavirin dosage), and the level of decline of Hb concentration (at the end of 1, 2, 4, and 8 weeks from the start of treatment). Ccr levels were calculated using the Cockcroft–Gault formula [25]. Variables with data deficiency of greater than 15% were not included in the decision-tree analysis. Data deficiency was 21% in liver histology (activity, fibrosis) and 16% in the level of decline of Hb concentration at the end of 1 week. Accordingly, these variables were excluded from the database.

On the basis of this database, we implemented the recursive partitioning analysis algorithm referred to as the decision-tree analysis algorithm [26] to define subgroups of patients with respect to the possibility of severe anemia. The data-mining software used was IBM SPSS Modeler 13 (IBM SPSS Inc, Chicago, IL, USA), as reported previously [2124]. In brief, the software searched the patient population for the most significant variables and cutoffs to be used for dividing the total population into 2 subgroups, having different probabilities of severe anemia. Thereafter, the analysis was repeated on all subgroups in the same manner until either no additional significant variable was detected or the sample size was less than 20.

For other statistical analyses, including multivariable analysis, IBM SPSS Statistics software v.15.0 (IBM SPSS Inc, Chicago, IL, USA) was used. Differences in proportions were tested by the chi-squared test. Differences in continuous variables were compared by Student’s t test. For univariate and multivariate analyses, logistic regression analysis was used to predict ribavirin-induced severe anemia. A value of P < 0.05 (two-tailed) was considered to indicate significance.

Results

Decision-tree analysis

Decision-tree analysis was carried out on the data of the model-building group using 27 variables, as described above. The analysis automatically selected 3 predictive variables to produce a total of 5 patient subgroups to build the decision tree (Fig. 1). Baseline Hb was selected as the first splitting variable, with an optimal cutoff of 14 g/dl. The possibility of severe anemia was 6.5% for patients with Hb levels <14 g/dl compared to 1.0% for patients with Hb levels ≥14 g/dl. Among patients with Hb ≥ 14 g/dl, the level of decline of Hb at the end of 2 weeks from the start of treatment, with an optimal cutoff of 2 g/dl, was selected as the second splitting variable. Patients with lower decline levels had a lower probability of developing severe anemia [<2 g/dl (group A) 0.4% vs. ≥2 g/dl (group B) 2.5%]. Among patients whose Hb was less than 14 g/dl, Ccr was selected as the second splitting variable, with an optimal cutoff of 80 ml/min. Patients with higher Ccr levels had a lower probability of developing severe anemia [≥80 ml/min, 2.4% vs. <80 ml/min (group C) 11.8%]. Among patients with a Ccr ≥ 80 ml/min, the level of decline of Hb at the end of 2 weeks from the start of the treatment was selected as the third splitting variable, with an optimal cutoff of 2 g/dl. Patients with lower decline levels had a lower probability of developing severe anemia [<2 g/dl (group D) 1.4% vs. ≥2 g/dl (group E) 11.5%].

Fig. 1
figure 1

Decision-tree analysis. Boxes indicate the splitting factors and the cutoff value for the split. Pie charts indicate the rate of severe anemia (Hb < 8.5 g/dl) for each group. Terminal groups classified by the analysis were labeled from A to E. Hb hemoglobin, Ccr creatine clearance

The probabilities of severe anemia for the 5 subgroups derived by this process were highly variable. The subgroup of patients with higher Hb levels (≥14 g/dl) (groups A and B) had a low probability of developing severe anemia (0.4–2.5%). Also, the subgroup of patients with lower Hb (<14 g/dl) but with a higher Ccr (≥80 ml/min) and lower Hb decline levels at the end of 2 weeks from the start of the treatment (<2 g/dl) (group D) showed a low probability of developing severe anemia (1.4%). On the other hand, the subgroup of patients with lower Hb (<14 g/dl) and lower Ccr (<80 ml/min) (group C) levels showed the highest probability of severe anemia (11.8%). Also, the subgroup of patients with lower Hb levels (<14 g/dl), higher Ccr (≥80 ml/min), and higher Hb decline levels at the end of 2 weeks from the start of treatment (≥2 g/dl) (group E) showed a high probability of developing severe anemia (11.5%).

Validation of the decision tree

The results of the decision tree were validated with the dataset of the internal validation group, which was independent of the model-building group dataset. Each patient in the validation group was allocated to groups A–E using the flow chart form of the decision tree. The rates of severe anemia (Hb < 8.5 g/dl) were 0.6% for group A, 3.0% for group B, 16.9% for group C, 2.3% for group D, and 11.0% for group E. The rates of severe anemia for each subgroup of patients were closely correlated between the model-building group and the internal validation group (r 2 = 0.96) (Fig. 2).

Fig. 2
figure 2

Validation of decision-tree analysis with the internal validation dataset: subgroup-stratified comparison of the rate of severe anemia. The rate of severe anemia in each subgroup was plotted. The X-axis represents the model-building dataset and the Y-axis represents the internal validation dataset. There was a close correlation between the model-building and internal validation datasets (r 2 = 0.96)

The efficiency and stability of the decision-tree model were validated using the discrimination efficiency curve (Fig. 3). The subgroups were sorted according to the order of incidence rate of severe anemia and validated using the correlation between the cumulative cases (%) and the cumulative incidence of severe anemia (%). The curve of the model-building group was located at the left upper part compared with the standard curve, indicating that the discrimination efficiency was high. Furthermore, the curve of the model-building group was extremely similar to the curve of the internal validation group, indicating that the stability was high.

Fig. 3
figure 3

Validation of the efficiency and stability by the discrimination efficiency curve. a Model-building group and b internal validation group. The groups were sorted in the order of incidence rate of severe anemia and validated using the correlation between cumulative cases (%) and the cumulative incidence of severe anemia (%). The X-axis represents the ratio of patients in the order of groups predicting the development of anemia and the Y-axis represents the cumulative patients suffering from severe anemia. The discrimination efficiency and stability of the curve of the model-building group were high

Factors associated with severe anemia determined by multivariate logistic regression analysis

We also explored the factors associated with severe anemia using standard statistical analysis. By univariable analysis, age, creatinine, Hb, Ccr, fibrosis stage, and decline of Hb at 2, 4, and 8 weeks from the start of treatment were found to be associated with severe anemia (Table 2) and the odds ratio for these factors were 1.06, 9.61, 0.47, 0.95, 3.14, 0.76, 0.70, and 0.68, respectively, by univariable logistic regression analysis (Table 3). By multivariate analysis, Hb, Ccr, and decline of Hb at 2 weeks from the start of treatment were found to be independently associated with severe anemia (Table 3). Fibrosis was not included in the multivariable analysis because data were not available for 238 patients. Creatinine was not included in the multivariable analysis because creatinine and Ccr were confounding factors due to their close correlation. Decline of Hb at week 2, 4, and 8 were also closely correlated. We selected decline of Hb at week 2 in the multivariable analysis because we think that variables at earlier time points may be more useful in clinical use. As a result, decision-tree and multivariable logistic regression analyses identified the same factors for prediction of severe anemia.

Table 2 Comparison of clinical parameters of patients with and without severe anemia
Table 3 Univariable and multivariable logistic regression analysis of factors associated with severe anemia

Discussion

Hemolytic anemia, a major common side effect of ribavirin treatment, is one of the most important adverse effects of PEG-IFN and ribavirin combination treatment. Therefore, before- and during-treatment prediction of the likelihood of severe anemia can be very useful for physicians to support clinical decisions concerning the dose reduction of ribavirin. Reducing the dose of ribavirin has been shown to affect the HCV RNA negativity [27], and the discontinuation of ribavirin has been reported to lead to a marked decrease of SVR [9]. Therefore, averting ribavirin discontinuance, even if its dose must be reduced, can lead to an improvement in the SVR rate. It is important to identify patients prone to develop severe anemia leading to ribavirin dose reduction or discontinuance in the early phase of treatment.

Using decision-tree analysis, we constructed a simple model for predicting the incidence of severe anemia during therapy. The analysis highlighted 3 variables relevant to virological response: Hb, Ccr, and the decline of Hb concentration by 2 g/dl at the end of the 2 weeks from the start of treatment. Classification based on these variables identified subgroups of genotype 1b chronic hepatitis C patients with high probabilities of developing severe anemia. The reproducibility of the model was confirmed with the internal validation dataset. An advantage of decision-tree analysis over traditional regression models is that the decision-tree model is user-intuitive and can be readily interpreted by medical professionals without the need for any specific knowledge of statistics. Patients can be allocated to specific subgroups based on a defined rate of severe anemia simply by following the flow chart format. Using this model, an estimate of the incidence of severe anemia can be obtained rapidly, which may facilitate clinical decision making for the reduction of ribavirin dosage. Thus, this model could be readily applicable for clinical practice.

According to the results of the decision tree, patients were categorized into 2 groups. The rates of severe anemia were 0.4–2.5% for the low probability group and 11.5–11.8% for the high probability group. For example, patients in the high probability group may be the most suitable candidates for dose reduction of ribavirin. Decision-tree analysis revealed that the high probability groups are patient groups with lower Hb (<14 g/dl) and lower Ccr (<80 ml/min) levels (group C) and patient groups with lower Hb (<14 g/dl), higher Ccr (≥80 ml/min), and higher Hb decline levels at 2 weeks from the start of treatment (≥2 g/dl) (group E). In particular, groups C and A were shown to be clinically significant in Fig. 3; group C includes the majority of patients suffering from severe anemia (65% in the model-building group and 67% in the internal validation group) and the very steep tilt angle of the group C slope means that group C patients have a very high probability of developing severe anemia. On the other hand, group A includes a large number of patients (40% in the model-building group and 40% in the internal validation group), and the very gentle tilt angle of the group A slope implies that group A patients have a very low probability of developing severe anemia.

Predicting the progression of anemia is necessary to decide whether medication can be continued while minimizing the disadvantages of anemia. The apparent clearance of ribavirin (CL/F), which reflects its plasma concentration at 4 weeks after the start of combination therapy, has been used as a predictive factor for developing ribavirin-induced hemolytic anemia before the start of treatment [9, 10]. However, the use of CL/F is not practical for general clinicians, because the calculation of CL/F is complicated. We revealed that a decline of Hb concentration by 2 g/dl at 2 weeks from the start of treatment (“2 by 2” standard) is both sensitive and convenient for identifying patients at high risk for severe anemia [10, 11]. The present study using decision-tree analysis revealed that Hb decline at week 2 was a significant and independent predictor of severe anemia. When considered along with other predictive factors, decision-tree analysis enables more exact identification of the patients prone to severe anemia.

Recently, a genome-wide association technique was used to show that ITPA polymorphism affects ribavirin-induced anemia. Polymorphisms (rs 1127354 and rs 7270101) that cause ITPase deficiency are strongly associated with protection from ribavirin-induced hemolytic anemia and with a lesser need for ribavirin dose reduction [2830]. These polymorphisms are very valuable, but the indication for treatment is determined not by them but by viral genotypes or by IL28B variations. The present decision tree, which involves a factor attained after initiation of PEG-IFN plus ribavirin therapy, i.e., Hb decline at week 2, is useful for selecting the best regimen, and can be easily used by general clinicians.

What is unique to the present study is the visualization of the probability of severe anemia by combining factors and its high reproducibility, as revealed by high-quality validation of the internal validation dataset that was completely independent of the model-building dataset. The factors used in the decision-tree model were clinical parameters that are readily available through the usual work-up of patients. This model can be immediately applied to clinical practice without imposing any cost for additional examinations.

A potential limitation of the present study is that data-mining analysis has an intrinsic risk of showing relationships that are relevant to the original dataset but are not reproducible across different populations. Although internal validation showed that our model had high reproducibility, we recognize that further validation using a larger external validation cohort, especially in populations other than Japanese, is necessary to verify the reliability of our model.

In conclusion, we built the decision-tree model for predicting severe anemia caused by PEG-IFN alpha-2b plus ribavirin combination therapy in chronic hepatitis C with genotype 1b and high viral load. Because this decision-tree model was composed of simple variables, it can be easily applied to clinical practice. This model may have the potential to support decisions concerning ribavirin dose reduction during PEG-IFN alpha-2b plus ribavirin combination therapy and contribute to increasing the rate of SVR.