Keywords

1 Introduction

In hospitals inside public and private healthcare systems, there is a growing concern on the quality and sustainability of the service. The readmission events, defined as the recurrent visits of a patient in a time span smaller that a given threshold, has become one of the quality measures, both regarding patient attention and economical factors. In some countries, insurance companies have set a time threshold below which they decline to answer for the cost of the patient care, and the hospital must assume it. Therefore, the prediction and prevention of these events is becoming economically critical for some institutions. In other countries, healthcare quality is the primary concern, so that preventing readmissions is a measure of improved patient attention. Readmission predictors are built by machine learning techniques, as specific two-class classifiers. A specific issue building these predictors from data is that the readmission events are much less frequent than normal admissions, i.e. the datasets are class imbalanced.

In supervised classification, data imbalance occurs when the a priori probabilities of the classes are significantly different, i.e. there exists a minority (positive) class that is underrepresented in the dataset in contrast to the majority (negative) class. In healthcare, as well as in other fields (e.g. fraud detection or fault diagnosis), instances of the minority class are outnumbered by the negative instances. Also, the minority class is the target class to be predicted because it is related to the highest cost/reward events. Most classification algorithms assume equal a priori probability for all the classes, so when this premise is violated the resulting classifier is biased towards the majority class. The resulting classifier has a higher predictive accuracy over the majority class, but poorer predictive accuracy over the minority class.

The degree of class imbalance is given by the imbalance ratio (IR), defined as the ratio of the number of instances in the majority class and the number of those in the minority class. Some studies have shown that classifier performance deteriorates even with modest class imbalance in the training data [11].

Although imbalanced data classes have been recognized as one of the key problems in the field of data mining [14], it is not usually taken into account in the literature of readmission risk prediction, despite some authors [2] have encountered class imbalance problems when building their predictive models. Some works such as [1, 12, 15] point out the existence of the class imbalance problem and propose methods to circumvent it. Nevertheless, only simple preprocessing approaches such as oversampling and under sampling are considered. Recent works [8, 10] in the field of disease risk prediction have attacked the problem of class imbalance using different preprocesing and ensemble techniques such as SMOTE or RUSBoost among others.

The main contributions of this paper are:

  • A methodology proposal for overcoming the class imbalance problem based on RUSBagging

  • An experimental study using real-world data where we compare the performance of different methods

The paper is organized as follows. In Sect. 2 we present our dataset as well as the methodological approach followed in order to build our models. Next, we describe the evaluation methodology and the experimental results. In Sect. 4 we discuss the conclusions and future work.

2 Materials and Methods

2.1 Experimental Dataset

We used a pseudonymised dataset composed of 99858 admission records recorded between January 2013 and April 2016 in the Hospital José Joaquín Aguirre of the Universidad de Chile, which is part of the public health system of Chile. The variables recorded in the dataset are divided into three main groups: (i) Sociodemographic and administrative data, (ii) Health status (iii) Reasons for consultation or diagnoses made at admission. Records with missing values are discarded for this study. Table 1 shows the characteristics of the dataset and the distribution of 72-hour readmissions among different variablesFootnote 1.

Table 1. Characteristics of the dataset

2.2 Data Pre-processing

Data was provided in a large ASCII text file containing 156120 admission records corresponding to 102534 different patient identities. After parsing the data, we built a dataset combining admission and patient-related data. Next, we cleaned the data by removing inconsistent and missing samples. Missing values where imputed using the arithmetic mean for continuous variables and the mode for categorical variables.

For each admission of a patient to the ED we calculated the number of days elapsed since his last visit. In order to build our model following a binary classification approach, the target variable meaning was set to readmitted/not readmitted. Those patients returning to the ED within 72 h after being discharged where considered readmitted, otherwise they were considered not readmitted.

Notice that a patient returning the very first day after discharge and another one returning the third day are both considered as readmitted. On the other hand, a patient returning the 73rd hour from discharge is considered as not readmitted.

2.3 Evaluation Metrics

The evaluation metrics that we have used are: sensitivity, specificity, accuracy and Area Under ROC Curve (AUC), defined as follows:

  • Accuracy. In binary classification, accuracy is defined as the proportion of true results among the total population:

    $$\begin{aligned} Accuracy= \frac{\varSigma TN+\varSigma TP}{\varSigma TN+\varSigma TP+\varSigma FN+\varSigma FP}, \end{aligned}$$
    (1)

    where TN is a true negative, TP a true positive, FN is a false negative and FP a false positive. In heavily umbalanced datasets it is not very meaningful because a simple strategy such as assigining each test sample to the majority class provides high accuracy.

  • Sensitivity. Sensitivity is a classification performance measure defined as the proportion of correctly classified positives:

    $$\begin{aligned} Sensitivity= \frac{ TP}{ TP+ FN}, \end{aligned}$$
    (2)

    Sensitivity provides more informative about the success on the target class.

  • Specificity. Specificity is defined as the proportion of negatives that are correctly identified as such:

    $$\begin{aligned} Specificity= \frac{ TN}{ TN+ FP}, \end{aligned}$$
    (3)
  • AUC. The Area Under ROC Curve (AUC) shows the trade-off between the sensitivity or \(TP_{rate}\) and \(FP_{rate}\) (1 - specificity):

    $$\begin{aligned} AUC= \frac{1 + TP_{rate} - FP_{rate}}{2} \end{aligned}$$
    (4)

    where the True Positive rate is equal to the Sensitivity and the False Positive rate is defined as \(FP_{rate}=\frac{\varSigma FP}{\varSigma FP+\varSigma TN}\).

Table 2. Confusion matrix for a binary classifier

2.4 Learning from Imbalanced Data

The main issue of learning from imbalanced datasets is that classification learning algorithms are often biased towards the majority class and hence, there is a higher misclassification rate of the minority class instances (which is usually the most interesting ones from the practical point of view). Figure 1 depicts a taxonomy of methods developed to deal with class imbalance [9] where three main techniques are identified, namely preprocessing, cost-sensitive learning and ensemble techniques. We give a quick overview of the different strategies.

Fig. 1.
figure 1

Taxonomy of Class imbalance problem addressing techniques as proposed in [9]

Preprocessing. Methods following this strategy carry out resampling of the original dataset in order to change the class distribution. Resampling techniques can be divided into three groups: (i) Undersampling techniques, consisting on deleting instances of the majority class, (ii) Oversampling techniques, that replicate or create new instances of the minority class, such as the Synthetic Minority Over-sampling Technique (SMOTE) [4], and (iii) Hybrid techniques, those that combine both resampling techniques.

Cost-Sensitive Learning. The strategy followed by cost-sensitive learning methods is to assign different cost values to each class misclassifications, so that the bias towards the majority class is balanced by the lower cost of misclassifications. A cost matrix is build assigning cost values to the entries of the confussion matrix giving (see Table 2). The usual approach is to heavily penalize misclassifications of the minority class. They are categorized into the following groups:

  • Direct methods, that introduce the misclassification cost within the classification algorithm.

  • Meta-learning, where the algorithm itself is not modified. Instead, a preprocessing (or postprocessing) mechanism is introduced to handle the costs. Meta-learning methodologies can be divided into two categories, namely thresholding and sampling.

Ensemble Classifiers. Ensemble methods rely on the idea that the combination of many “weak” classifiers can improve the performance of a single classifier [6]. They are divided in two groups, namely cost-sensitive ensembles and data and algorithmic approaches.

  • Cost-sensitive ensemble techniques, are analogous to cost-sensitive methods mentioned earlier, although in this case, the cost minimization is undertaken by the boosting algorithm.

  • Data and algorithmic approaches, which embed a data preprocessing technique in an ensemble algorithm. Depending on the ensemble algorithm they use, three groups are identified: (i) Boosting, (ii) Bagging and (iii) Hybrid.

Bagging [3] consists in creating bootstrapped replicas of the original dataset with replacement (i.e. different copies of the same instance can be found in the same bag), so that different classifiers are trained on each replica. Originally each new data-set or bag mantained the size of the original data-set. Nevertheless, UnderBagging and OverBagging strategies embed a resampling process, so that bags are balanced by means of undersampling or oversampling techniques. To classify an unseen instance, the output predictions of the weak classifiers are collected performing a majority vote in order to produce the joint ensemble prediction. In this group we find, among others, algorithms like SMOTEBoost [5] or UnderBagging [13] which embed undersampling within the ensemble algorithm. We propose RUSBagging which carries out a random undersampling for each bag generated in the ensemble creation. An individual weak classifier is trained from the data in each bag.

3 Experimental Results

In this section we present the results obtained when trying to predict the readmission risk before 72 h over the dataset presented in the previous section. We have tested two data balancing methods: random undersampling (RUS) and random undersampling embedded in a bagging approach. We used the following well-known classification algorithms, implemented in the open source machine learning library scikit-learnFootnote 2:

  1. 1.

    Decision Tree (DT), setting Gini impurity as splitting criterion

  2. 2.

    Random Forest (RF), setting Gini impurity as splitting criterion and number of estimators = 10

The models were evaluated using 10-fold cross-validation, performing 10 independent executions. Accuracy, specificity, sensitivity and AUC were calculated for each execution, so average and standard deviation were computed. In order to statistically compare results we employed an Analysis of Variance (ANOVA) approach.

The following data balancing approaches were compared: (i) Original dataset with its imbalanced class distribution, (ii) Undersampling with random undersampling and (iii) RUSBagging. Table 3 shows the average accuracy, sensitivity, specificity and AUC along with its respective standard deviation, for each method and classifier.

3.1 Comparison of Classifiers

According to the results shown in Table 3, both classification algorithms, Random Forest achieve significantly better results (p < 0.001) than Decision Trees looking at the AUC. Though DT performs better in the original dataset (anyhow both classifiers perform poorly), when preprocessing and ensemble approaches are utilized RF performs much better. As shown in Fig. 3, the AUC is significatively greater for RF when RUSBagging is used, however, sensitivity is sacrificed if compared with DT. Overall, results are poor, however they compare well with the state of the art in readmission prediction. In a recent review [7], most studies reported performances measured by AUC near 0.5, with some outliers achieving a maximum of 0.7 (Fig. 2).

Table 3. Mean (±standard deviation) of performance metrics for each data balance method and classifier model configuration
Fig. 2.
figure 2

ROC curve for DT using undersampling, RUSBagging and original

Fig. 3.
figure 3

ROC curve for DT and RF algorithms using RUSBagging method

3.2 The Effect of Preprocessing and Ensemble Methods

Several conclusions can be extracted from the results shown in Table 3.

  • The models trained without modifying the original class distribution were clearly biased towards the majority class. Although accuracy scores were high (>90%), specificity was close 100% while sensitivity tended to zero. Thus, according to the AUC scores, models performed similar or just slightly better than a random classifier.

  • Using random undersampling for class balancing had a direct effect in the performance of the resulting model. Results show that both DT and RF get better AUC scores, 0.56 and 0.58 respectively, and sensitivity increases considerably. However, as could be expected, both accuracy and specificity tend to decrease.

  • RUSBagging, which embeds random undersampling within a bootstrap aggregating algorithm, outperforms both previous methodologies. According to the AUC scores, the combination of RUSBagging and Random Forest shows the best performance with a mean of 0.60.

  • The performance of the models considering the AUC metric, suggests poor discrimination ability. Nevertheless, a systematic review on risk prediction models for hospital readmission documented similar AUC scores (ranging from 0.50 to 0.70) in most of the studies [7].

4 Conclusions and Future Work

In this paper we have presented the results of readmission prediction based on a real dataset from a hospital in Santiago, Chile. To overcome the class imbalance problem we propose an approach called RUSBagging, that carries out random undersampling for each bag in a bagging ensemble training.

Results show that RUSBagging in combination with Random Forest significantly improves predictive performance in the context of a highly imbalanced dataset. Nevertheless, our model has shown limited predictive ability for clinical purposes, what seems to be related with the inherent dfficulties and limitations of the readmission risk prediction problem. We have attacked one major issue (data imbalance) but others such as the appriate selection and measurement of varaibles remain untouched in this paper. In order to validate the usefulness of our presented approach, we plan to gather and include additional baseline status and administrative data, to perform a prospective study. Future work will also include an extension of our comparative study including new methodologies and classifiers.