Introduction

In recent times, information about cancer patients has increasingly been stored in large data sources. These databases are often built for studying changes in the incidence and behavior of cancers. Among cancers, breast malignancy is the most common cancer and the second highest cause of cancer death among women. It is a major health problem and represents a significant worry for many women and their physicians [1]. When this disease is diagnosed and determined to be localized without evidence of metastasis, it is still critical to identify patients who are at a substantial risk of experiencing cancer recurrence, especially distant metastasis.

Assessment of an individual woman’s actual risk of recurrence of breast cancer is difficult. Among known risk factors are abnormal values for some morphological and pathological tumor specifications and biological tumor markers. Identification of risk factors that are associated with the recurrence of cancer makes it possible to tailor the most appropriate treatment for the individual. Patients assigned to high-risk groups get more intensive treatment and more frequent follow-ups. This assessment constitutes a very critical decision and the role of domain experts is important. However, the availability of these experienced oncologists is limited. The challenge is how to support less experienced oncologists when they need expert knowledge in order to care for their patients [2]. It would be of considerable benefit if knowledge about what to do and how to do it could be extracted from data sources. Electronic medical records and registers are data sources that can provide knowledge about how different patients have been diagnosed and treated. Knowledge discovery in databases (KDD) [3] can be used to create a representation for this knowledge. Data mining is a part of KDD that is designed to look through data in search of patterns or relationships between variables, and then to validate the findings by applying the identified models to new data [4]. Decision tree induction (DTI) is a data mining method in the form of a tree structure, and it is used to classify cases in a dataset [5]. The resulting tree is a representation that can be verified by humans and can be used by either humans or computer programs [6]. DTI has been used in different areas of medicine including oncology [7, 8] and respiratory diseases [9]. Decision trees can be easily visualized and formulated into if–then rules. DTI has been compared in several studies with other techniques such as Artificial Neural Networks [1012], and it has been shown that the accuracy of the techniques is similar. However, DTI produces an understandable model that explains the reasoning of the method, in contrast to the “black box” approach in ANN. In building predictive models, there is a risk of overfitting the training data, which leads to poor accuracy in future predictions. The solution here is pruning of the tree, and the most common method is post-pruning. In this method, the tree grows from a dataset until all possible leaf nodes have been reached, and then particular subtrees are removed. Post-pruning creates smaller and more accurate trees [13].

A prerequisite for successful knowledge discovery is the availability of quality data. A dataset that is representative of a population and contains all important variables affecting a specific event is needed.

By analyzing the data stored in a regional cancer register by a data mining method, we try to find rules for detecting high risk breast cancer patients. These patients may develop distant metastasis—invasion of other organs by malignant cells—and need special attention. A predictive model resulting from DTI can support less experienced oncologists. However, for any such use of a decision support model the model needs validity, transparency and an acceptable degree of accuracy.

In this study, we first analyzed a regional cancer register by DTI in order to develop a predictive model for predicting the occurrence of distant metastasis in breast cancer patients. Thereafter, the accuracy of the predictions for the 100 randomly selected cases were compared with predictions made by two domain experts to see if there was any significant difference between these different prediction sources.

Background

Recurrence of breast cancer and distant metastasis

Recurrence of breast cancer often occurs in the first 3–5 years after diagnosis. It can come back as a local/regional recurrence or as a distant metastasis. The most common sites of recurrence include the lymph nodes, bones, liver, and lungs [14].

In loco-regional recurrences malignant cells remain in the original site in a preserved breast, in the chest wall or in regional lymph nodes, and over time grow back. This may be because of failure of the primary treatment or return of the tumor cells. Distant metastasis is the fatal type of recurrence. When out of the breast, cancer usually spreads first to the axillary lymph nodes. In 25% of distant recurrences, breast cancer spreads from the lymph nodes to bone. Other sites to which breast cancer may spread include the bone marrow, lungs, liver, brain, or other organs. Unfortunately, the chance of recovery after this recurrence is low, and death due to breast cancer is very probable following the occurrence of distant metastasis.

Predictors for high risk breast cancer

Variables that are predictors for the recurrence of breast cancer include some of the following. The S-phase fraction is a measure of the percentage of cells in cancer cells that are in the phase of the cell cycle during which DNA is synthesized. Some studies have shown that higher fractions are generally associated with poorer overall survival [15]. Examining lymph node involvement is essential when assessing the probability of breast cancer recurrence. The overall survival of patients has been shown to decrease as nodal involvement increases [16]. Periglandular growth of the malignant tumor [17], size of the tumor [18], and receptors for estrogen and progesterone [19] have also been found to be important predictors for recurrence of this disease. Some studies indicate that age plays a role [20], and very young patients have a poorer prognosis. Age is also important for loco-regional recurrence. Some other predictors might also be important, but they are not usually recorded in the breast cancer registers. The fact that the above mentioned variables are important predictors of breast cancer recurrence was confirmed in our previous study [21].

Cancer registers

Six regional cancer registers perform cancer registration in Sweden and one of these serves south-east Sweden, comprising the counties of Kalmar, Jönköping and Östrgötland with a population of about one million. The breast cancer register for the south-east region of Sweden has the following properties that make it a useful dataset. It covers more than 95% of breast cancer patients in the region [22]. Its quality is assessed regularly and probable mistakes are checked by directly contacting physicians or pathologists. In this region there are registers that are used to provide data regarding additional risk factors and to give a better estimation of the recurrence of breast cancer, i.e. the tumor marker register and the death register. The tumor marker register includes values for some newer laboratory measurements for breast cancer such as receptors for estrogen and progesterone and S-phase fraction. The death register contains information about cause of death and can be linked to other registers by using the unique personal number.

Since there is information about tumor specification, treatment and follow-up in these registers, it is possible to find patterns describing the recurrence of breast cancer. Patients get their treatment based on the knowledge of clinicians and on protocols. The prognosis of the disease depends on the combination of each patient’s disease specification and treatment. By analyzing these data, hidden knowledge may be discovered. By representing this knowledge and making a predictive model, it is possible to predict the outcome of new patients.

Knowledge discovery in databases (KDD)

KDD is the evolving field that provides automated analysis solutions for extraction of implicit, unknown knowledge and potentially useful information from data [23]. Data mining is the pattern extraction stage of the KDD process [24]. The extracted patterns may be used for diagnosis, screening, prognosis, monitoring, therapy support or overall patient management, and these methods have been successfully used in predicting survival in breast cancer [7].

To uncover and formulate the hidden knowledge, a number of steps should be considered [3]. After understanding the domain and finding suitable sources of data, the next step is preparing these data. Cleaning data from noise and outliers and handling missing values, and then finding the right subset of data, prepares them for successful data mining. Afterwards, in the data mining step, the processed data are used to create a model that can be employed for predicting recurrence in newly diagnosed patients. Data mining has been defined as “the nontrivial extraction of implicit, previously unknown, and potentially useful information from data” [25] and “the science of extracting useful information from large data sets or databases” [26]. One important goal of data mining is prediction, which is the most common type of data mining with the most direct practical applications.

Materials and methods

The first phase of this study consisted of preparing data sources, linking and matching datasets, pre-processing data, data mining and building a predictive model. In the next phase, prediction accuracies for the occurrence of distant metastasis or death because of breast cancer made by human experts were compared with prediction accuracies from decision tree induction. Afterwards, ROC curve analysis was used for validation. Figures 1 and 2 schematically describe methods that were applied in this study.

Fig. 1
figure 1

Steps leading to building a predictive model

Fig. 2
figure 2

Comparison study

Data preparation

In order to build the best possible predictive model, variables from different sources were collected. The main dataset was the regional breast cancer register for south-east Sweden.

Data were collected from female patients, mean age 61.9 years, with the diagnosis of malignant breast cancer. The earliest patient was diagnosed in 1986 and the last one in 1995. Because the outcome for this study was distant metastasis occurring up to 4 years after diagnosis, patients who were followed up for less than this period were omitted from the study. There were 664 (18%) patients with this type of recurrence in the dataset.

If patients developed symptoms following treatment they were referred to the hospital, but otherwise follow-up visits occurred at fixed time intervals for all patients.

The methodology for preparing the data for the main analysis was the same as in our previous studies [21, 27]. This step started with selecting appropriate variables, cleaning the raw data and removing outliers by running a set of logical rules. Some examples of these outliers were negative values for the time between cancer diagnosis and recurrence and very high values for patient age at the time of diagnosis. The register was searched for multiple entries (unknown to the authors in the previous study) and repeated cases were omitted. After eliminating repeated cases, the number of cases decreased to 3,699.

Subsequently, missing values for continuous variables were substituted using multiple imputation (MI) [28]. In this technique, missing values are replaced by final values resulting from repeated imputation, analysis and pooling steps. The variation among different imputations shows the uncertainty with which the missing values can be predicted from the observed ones. The result is several complete datasets. Thereafter, each of the simulated complete data sets is analyzed by standard methods, and the results are combined to produce estimates and confidence intervals that incorporate missing data uncertainty. Handling missing values by MI was done using the standalone version of NORM software written by Schafer [29]. The software starts by fitting models to incomplete data using the expectation maximization (EM) algorithm [30]. This algorithm is a parameter estimation method that falls within the general framework of maximum likelihood estimation and is an iterative optimization algorithm. Following convergence of the EM algorithm, a data augmentation (DA) procedure was implemented. DA is an iterative process that utilizes the observed data to provide estimates of both the missing data and distributional parameters. The result of handling missing values is shown in Table 1. In this table, some statistics before and after handling missing values are presented.

Table 1 Characteristics of study variables

After handling missing values, some variables were dichotomized and were transformed to binary variables. The rules for binarization of these variables are shown in Table 2. These rules are based on positive/negative or normal/abnormal values for those variables. An appropriate set of variables was then selected by using canonical correlation analysis (CCA) as a dimensionality reduction technique [21, 27]. In order to reduce the risk for bias in the study, 100 cases were randomly separated from the dataset for validation and the remaining 3,599 cases were analyzed with CCA and then used for data mining and model building. After analyzing the data with CCA, a clinically relevant outcome, i.e. distant metastasis or death because of breast cancer within 4 years, was associated with the predictors.

Table 2 Rules and characteristics for variables that underwent dichotomization

Data mining and predictive model building

Several data mining techniques have been examined in breast cancer studies. Predicting breast cancer survival using different data mining methods [7], and comparing the predictive accuracy of a staging system with artificial neural networks [31] are some examples. In comparison with different data mining methods, decision tree induction (DTI) performs well and the resulting predictive model is understandable. The algorithm uses information gain as a heuristic for selecting the variable that will best separate the cases into each outcome [32]. Good interpretability of acquired knowledge and fast execution make decision trees one of the most frequently used data mining techniques [33].

In this study, a predictive model was made by applying DTI to the prepared data. DTI was carried out using the J48 algorithm in WEKA [34]. WEKA is a set of machine learning algorithms for data mining tasks and the algorithms can either be applied directly to a dataset or called from other programs. As in our previous study, we used WEKA for mining (applying the J48 algorithm) to breast cancer register data [27]. The application contains tools for data preparation, classification, clustering and visualization. In WEKA, the J48 algorithm is the equivalent of the C4.5 algorithm written by Quinlan [5]. Post-pruning based on a 10-fold cross validation was also done to trim the resulting tree [13].

For estimating the generalization error of the predictive model, the 10-fold cross validation technique was used [35]. The data (excluding the 100 cases) were divided into ten subsets of about the same size. Then the tree was trained ten times, each time leaving out one of the subsets from training. The omitted subset was used for testing and computing the error. These error estimates were used to adjust the extent of pruning the decision tree.

Validation

One hundred cases were selected by stratified random sampling after the data sources were linked and cases were matched. The ratio of outcome positive cases was the same between the whole population and the 100-case sample. This dataset was used for validating the model by comparing the predictions with those of domain experts.

The predictive model acquired from DTI was used to predict the occurrence of the outcome, and its probability for each case was recorded. The same 100 cases were given to two domain experts for the prediction of outcome (Fig. 2). The raw data for these 100 cases were presented to them, without any pre-processing, in a paper based questionnaire (Fig. 3). Information for each patient, representing age, physical examination, pathological investigation, and hormone receptor and tumor marker studies, was printed in the questionnaire. Then for each case, the oncologists were asked to place an “X” on a visual analog scale (VAS), from 0 to 100%, to indicate the probability of the occurrence of the outcome. A sample from the questionnaire is shown in Fig. 3.

Fig. 3
figure 3

Questionnaire form given to domain experts

The experts were asked to complete the questionnaires in one session. The number of cases, i.e. 100, was chosen after a discussion with the oncologists regarding the length of the session and how many cases they could read and predict in one session because of their busy schedules.

The discriminating power of the predictive model was tested and compared by calculating the areas under the ROC curves (AUCs) [3638]. In the next step, these AUCs were compared using the pair-wise comparison method to show whether the differences were significant. Furthermore, differences between the DTI algorithm and each specialist, and the 95% confidence interval (CI), were calculated.

The Hosmer Lemeshow goodness-of-fit test [39, 40] was applied to evaluate how closely the predicted recurrence probabilities fit the observed recurrences.

Results

The decision tree was trained with 3,599 cases (after the exclusion of 100 cases). The complete decision tree and some statistics are shown in Fig. 4 and Table 3. This model was then used for predicting the probability of the occurrence of the outcome of the disease in 100 cases.

Fig. 4
figure 4

The resulting decision tree. LN involvement 1/0 shows if the tumor has invaded/not invaded adjacent lymph nodes, tumor size is in millimeters and is obtained from the pathology report. If estrogen or progesterone receptor proteins are positive they are transformed to 1, and if not they are transformed to 0. If the tumor is not palpable in the physical examination then the variable N0 tumor is 1, and otherwise it is 0. Periglandular growth 1/0 indicates if the tumor has grown/not grown outside the tumor boundaries, and S-phase fractions less than 10% are transformed to 0 and larger amounts are transformed to 1. In the leaves (gray boxes), there are two numbers in parentheses. The first number shows the number of cases who reached this leaf and the second shows the number of cases for whom the leaf class was not predicted to happen. The number outside the parentheses indicates the class for cases that reach this leaf. 1 means cases with recurrence of the disease and 0 means absence of recurrence

Table 3 Performance of the predictive model created from all data minus 100 cases with 10-fold cross validation

Probabilities resulting from the predictive model and from domain experts plus the real outcome for 100 cases were collected and ROC curves were drawn (Fig. 5). Areas under the ROC curve (AUC) for each method were 0.755, 0.810 and 0.847 for DTI, oncologist 1 and oncologist 2, respectively. The difference in AUCs between DTI and oncologist 1 was 0.055 (95% CI = −0.043–0.153) and the significance level for the difference was 0.27. The difference in AUCs between DTI and oncologist 2 was 0.092 (95% CI = −0.001–0.186) and the significance level for the difference was 0.053. In Table 4, a confusion matrix shows predictions done by domain experts and the decision tree and their comparison with the real values.

Fig. 5
figure 5

A comparison between ROC curves

Table 4 Confusion matrix showing predictions done by oncologists and the decision tree in comparison with the real outcomes (no reply from oncologist 1 for one of the cases 1)

After performing the Hosmer Lemeshow goodness-of-fit test, chi-squared values were 3.29, 10.47 and 27.74 and p-values were 0.19, 0.16 and 0.0005 for DTI, oncologist 1 and oncologist 2, respectively.

Discussion

Predicting the probability of occurrence of distant metastasis is a very critical task. Both false positive and false negative predictions have unwanted effects on the patient and on the health care system. Accordingly, it is important to discuss the feasibility of the methodology proposed in this study.

The scope of data mining methods

The main aim of constructing cancer registers is not data mining. The data are not gathered for this purpose and registers may not contain all the necessary information. For successful data mining, a maximum number of relevant variables should be available in addition to high quality data. An ordinary breast cancer register may not contain all of the important predictors of recurrence. With the addition of high-tech laboratory tests, the estimation of recurrence may be improved; however, the main research question addressed in this study was whether useful knowledge could be extracted from an ordinary clinical database.

Different arguments have been used concerning how to prepare the data, what method of data mining to use and how to evaluate the results. The good thing about data mining methods is that when they are trained with high quality, relevant data, they perform well. This is why data preparation is an important step in knowledge discovery [41].

Decision support via the predictive model

Convincing clinicians about the usefulness of a clinical decision support model is an important task. In order to do so, we should be able to show the goodness and usefulness of the model. To provide patients with quality health care, clinicians with good knowledge of the specific domain as well as extensive clinical experience are necessary. Senior oncologists use their experience and knowledge to study the risk factors of individual patients. This experience is gained after years of practice and cannot be learned through theoretical education alone. This experience-rich knowledge can be visualized, preserved and reallocated by a predictive model that is to be integrated in a decision support application for use by less experienced oncologists. This is a challenging task, and if it can be done successfully, it will help to increase the quality of health care. However, the most important issue is the attitude of clinicians toward using such a clinical decision support application. In arguing that the extracted knowledge expressed as a predictive model works as well as experienced clinicians, AUC is used to compare the predictions .

Clinicians tend to overestimate the severity of diseases. False positive predictions are more acceptable than false negative predictions. This means that the cut-off for predictions is not the traditional 0.5, and sensitivity and specificity could be different for different diseases based on their severity. To handle this problem we used AUC, which analyzes the whole range of cut-off levels and constitutes a more general validation.

A comparison of AUCs shows that the three approaches for predicting recurrence have no significant differences in discriminating power. The test result provides a p-value where higher values (p > 0.05) indicate non-significant differences between observed and predicted probabilities. In this case, it implies that the model’s estimates fit the data at an acceptable level. However, calibration as assessed by Hosmer Lemeshow goodness-of-fit statistics shows that the DTI model has a higher p-value and works more reliably than the oncologists in predicting the probabilities for recurrence of breast cancer. A predictive model cannot be both perfectly reliable (i.e. calibrated) and perfectly discriminatory [40]. One is increased at the expense of the other, and this may be the reason that the oncologist who had the highest AUC got the lowest calibration (lower p-value).

Our proposed model tends to predict cases with no recurrence better (Table 3, Table 4), because our training database is dominated by “no recurrence” cases. However, since the performance of the model is comparable to the predictions made by domain experts for the 100 test cases, this constitutes an argument in favor of the model and its usability when domain experts are not available.

One way of improving the prediction accuracy for cases with positive recurrence is to use a balanced dataset. With this approach, the number of cases in both classes should be the same. However, because of the rather low number of recurrences, this might result in a small dataset. Training DTI with a small dataset may not result in a meaningful predictive model. Another way is to use sampling techniques to make balanced datasets. This approach combines over-sampling the minority (abnormal) class and under-sampling the majority (normal) class for achieving a better classifier performance [42]. However the dataset is artificially manipulated and may not be representative of the dataset [43].

Using a visual analog scale (VAS) to capture judgments makes it easier for clinicians to express their predictions because they are looking at the whole risk from 0–100%, and this is easier than just writing a percentage.

Validation of the decision tree model

Physicians use all available information about their patients when deciding about the severity of their diseases. This also includes the appearance and mood of the patient and a complete physical examination at the first visit. The decision tree model, on the other hand, is dependent on the availability and quality of the data stored in the register. However, for a realistic comparison between the predictions made by domain experts and the decision tree model, the datasets should be similar. For this reason, the same 100 cases and the same variables for each case were provided to domain experts and the DTI predictive model.

The method used for random sampling is important. In this study, the method is stratified according to the outcome. The ratio of the 1/0 values for the outcome in the sample is equal to that for the whole population. This is important, because due to the random nature of the sampling it is possible to get very high or very low ratios, which can distort the results.

The greater the number of domain experts in a study, the more reliable the results will be. The participation of more experts makes it possible to examine the variability between experts. However, a study of inter-rater variability was not part of the objectives of this study, and the two experts were selected as representatives of clinical practice. The busy schedules of the oncologists and the need to complete the forms in one session made it difficult to use more than 100 cases.

Future work

It is also possible to review different data mining studies concerning a specific cancer, i.e. breast cancer, and improve the register by adding more relevant predictors. This will improve the quality of the register as a source for a better training set for data mining later on.

In continuing this study, the aim will be to combine the DTI predictive model with guidelines for the treatment and management of cancer in a clinical decision support application integrated with routine daily work.

In following up this study, another step would be to use more domain experts. In order to be able to generalize the performance of our methodology, we should have more domain experts for further validation of the results.

Conclusion

Comparison of the results of human experts and those of the predictive model show that it is possible to formulate the knowledge that is hidden in registers in the form of a decision tree. Since a DTI model is easy to understand and implement, while at the same time producing predictions with the same accuracy as domain experts, the proposed methodology can be used as a semi-automatic knowledge discovery for building predictive models in oncology.