Introduction

According to the report of World Health Statistics in 2013, cancer is a leading cause of death [1]. In United States, the breast cancer is the first woman cancer incidence and the second caner mortality [1]. In Taiwan, according to the “Statistics Report of Bureau of Health Promotion, Department of Health 2012”, there were more than nine thousands women suffering from breast cancer, and one thousand and eight hundreds women of breast cancer death. Also in Taiwan, the breast cancer is the first woman cancer incidence and the forth caner mortality. The cancer statistics reveals that the breast cancer is one of the most serious threat to women’s health.

Referring to Table 1, there are three most common screening methods of breath cancer prediction, including mammography, ultrasound, and MRI. These screening methods may reduce breast cancer mortality and increase breast cancer survival rate. In Taiwan, the BHP (Bureau of Health Promotion) provides a bi-annual mammography in women aged 45-69. In addition, there is an evidence that early detection through mammography screening and adequate follow-up of women could significantly reduce mortality from breast cancer [2, 3]. However, these screening methods demand a considerable cost. The mammography screening may not be cost-effective. In the meanwhile, the over-diagnosis of screening mammography to detect breast cancer has been reported [4]. Bleyer and Welch estimated breast cancer was overdiagnosed in 1.3 million U.S. women in the past 30 years. In 2008, the researches estimated breast cancer was overdiagnosed in more than 70,000 women, and this accounted for 31 % of breast cancers diagnosed.

Table 1 Screening methods for breast cancer [57]

In this paper, we would like to propose a computational model to evaluate the risk of breast cancer which is only based on patient questionnaire information. In prior to mammography screening, our computational method can be served as a pre-diagnosis program in the low-cost setting.

We make use of Weka (Waikato Environment for Knowledge Analysis, a collection of machine learning algorithms for data mining tasks) to build a computational predict model for breast cancer risk assessment. There are various kinds of classification methods implemented, in which we conclude these methods into three categories: “basic classifier”, “ensemble method”, and “cost-sensitive method”.

In the first category of “basic classifier”, we choose the J48 (Trees), LMT (Trees), NaïveBayes (Bayes), LibSVM (Functions), IBk (Lazy), RBFNetwork (Functions) described as follows.

  • J48 classifier: The J48 classifier is using the C4.5 algorithm to generate a decision tree for prediction. Based on the concept of information entropy, a tree-based model is constructed in which the easily-interpreted model may reach a reasonable precision.

  • LMT (logistic model tree) classifier: The LMT classifier is a classification model, which combines decision tree and logistic regression learning.

  • Naïve Bayes (NB) classifier: The Naïve Bayes classifier is based on Bayes’ theorem of probabilistic statistical classifier. Usually, the Naïve Bayes classifier is robust to isolated noise points and irrelevant attributes which following the statistical principle for combining prior knowledge of the classes gathered from data.

  • LibSVM (support vector machines, SVM) classifier: The LibSVM classifier constructs a hyperplane to separate the different classes of data. The LibSVM classifier maximize the margin around the separate hyperplane.

  • IBk classifier: The IBk classifier is the k nearest-neighbor algorithm. The k nearest-neighbor algorithm is a type of lazy learning and instance-based learning. The IBk classifier has an advantage of constructing arbitrary-surface boundaries. The IBk classifier is also applicable for data in high variance distribution.

  • RBFNetwork (RBF) classifier: The RBFNetwork classifier is an instance-based learning method which implements a normalized Gaussian radial basis function network to predict.

In the second category of “ensemble method”, we choose VOTE (Meta), AdaBoostM1 (Meta), Bagging (Meta), Stacking (Meta), RandomForest (Trees) described as follows. The “ensemble method” make use of multiple “basic classifiers” to obtain better predict performance than that could be obtained from any of the constituent classifiers. In other words, an ensemble is a technique for combining many weak classifiers in an attempt to produce a strong classifiers.

  • VOTE classifiers: The VOTE classifier is a common theoretical framework for combining classifiers which use distinct pattern representations to accomplish a compound classification where all the pattern representations are used jointly to make a decision [8].

  • AdaBoostM1 classifier: In the beginning, the AdaBoostM1 classifier assigns weight to each training instances. Then, the AdaBoostM1 works by repeatedly running a given weak learning algorithm on various distributions over the training data, and then combining the classifiers produced by the weak learner into a single composite classifier. The output of the weak learners is combined into a weighted sum that represents the final output of the boosted classifier.

    It reduces the bias of the weak learner by forcing the weak learner to concentrate on different parts of the instance space, and it also reduces the variance of the weak learner by averaging several hypotheses that were generated from different subsamples of the training set.

  • Bagging classifier: The Bagging classifier is a special case of the model averaging approach. The Bagging classifier randomly create subsets of original data, then aggregate each subsets predictions to determine a final prediction.

  • Stacking classifier: The Stacking classifier is a different way of combining models, in which the Stacking classifier works by deducing the biases of the generalizers with respect to a provided learning set [9].

  • Random Forest (RF) classifier: The random forest classifier combines a multiple of decision tree. Each decision tree are independent predictions, the largest number votes for the final result class.

Regarding the third category of “cost-sensitive methods” [10, 11], the cost-sensitive classification aims to reach a minimal cost class results on a class imbalance dataset. When applying the cost-sensitive method, we have to preset the cost matrix shown in Table 2. The cost-sensitive classifier is attempting to find prediction the class with minimum misclassification cost, by re-weighting training data instances according cost matrix assigned to each class. In our study, we will set a higher cost of false negative (FN) case, since the FN misjudgment might be serious and could result in delay seeking medical treatment for possible patients.

Table 2 The cost matrix for cost-sensitive methods

Because the class imbalance nature of data source, in this project, we first apply sampling approaches to obtain the entire set of data of interest and to improve the detection of rare cases.

In addition, since unequal penalty of making decision (including TP, FP, FN, and TN), we would like to build a computation model to predict patient of high risk with almost 100 % recall and reasonable high precision. An alternative metric, which will be detailed in section “Materials and methods”, is also introduced to describe the performance of classification models.

Materials and methods

In this section, we will first introduce our computing platform and describe our dataset. Then, we present the approach, as well as the performance measure.

The BIRADS data is collected from Taipei City Hospital, starting from 2008.01.01 to 2008.12.31. We enrolled women who received breast cancer screening program with sonography in Taipei City. Data were collected by filling a questionnaire before examination. The assessment category of sonography are used to determine the target attribute as high risk or low risk in the “Breast Imaging-Reporting and Data System”. There are 3,976 records in our BIRADS dataset in which only 94 records are true (High risk). Since some missing values in this data set, we exclude those incomplete records, and there are 3,035 records left [12]. In reference to Table 3, there are thirteen attributes and the “High risk” attribute is the target to be predicted. The characteristic of BIRADS data set is illustrated in Fig. 1.

Table 3 The attributes of BIRADS data set
Fig. 1
figure 1

The boxplot of numeric attributes in BIRADS data set (The ‘NO.i’ denotes the i-th attribute in Table 3)

To estimate the performance accuracy of computational model, we use m-fold cross-validation in which the original dataset is randomly divided into m equal size partitions. Of the m partitions, a single partition is considered as the validation data for testing the model, and the remaining (m − 1) partitions are used as the training data. In the experiments, we repeated the cross-validation process m times, and report the average results.

In reference to Fig. 2, we introduce the approach of our study and illustrate the process flow of performing experiments.

Fig. 2
figure 2

Process flow of performing experiments

Given the BIRADS data set, in the preprocessing step, those records containing missing values are excluded first. Then, all twelve attributes are normalized and replaced by z-scores.

In the next step, we apply the stratified splitting procedure in which 67 % records are the training set of containing both “high risk” and “low risk” records. Similarly, in the testing set, there are 33 % records of containing both “high risk” and “low risk” records.

Since the data set is of imbalanced classes, we only have limited “high risk” records (76 among 2959). We apply sampling techniques in priori to the model construction. In our study, we apply the under-sampling technique and the over-sampling technique for training data, respectively.

In the under-sampling technique, the data set of majority class will be shrunk. In the over-sampling technique, the data set of minor class will be expanded. After the sampling technique, the size of positive and negative classes are comparable, and the class boundary could be more clear.

In our study, we also make use of the dimension reduction technique to further reduce the data size, but the data characteristic still maintains. We consider the LSA for the dimension reduction [13, 14].

The LSA (Latent Semantic Analysis) is a document processing technique originated in the field of information retrieval, in which we apply a series of operations from linear algebra, known as matrix decomposition, to construct a low-rank approximation to the term-document matrix. Some typical applications of low-rank approximation is to index and retrieve documents, as well as to cluster documents.

Given an m by n term-document matrix A, a SVD (singular value decompostion) of A can be written as

$$ \mathbf{A}=\sum\limits_{i=1}^{rank(\mathbf{A})} \sigma_{i} \mathbf{u}_{i} \mathbf{v}_{i }^{T } =\mathbf{U} {\Sigma} \mathbf{V}^{T} $$
(1)

where σ i is the i th singular value of A.

Usually, we choose the first k singular values to obtain the rank k approximation, as follows.

$$ \mathbf{A}^{\prime}_{k}=\mathbf{U}^{\prime}_{k}{\Sigma}^{\prime}_{k}\mathbf{V}^{\prime^{T}}_{k} $$
(2)

The low-rank approximation matrix A′ yields a new representation for each document in the collection, which is expected to combine and merge the dimensions associated with terms that have similar meanings.

As a result, the original records represented as vectors in the twelve dimensional space can be reduced as the corresponding vectors in the k dimensional space. The k dimension axes are also considered as the kconcepts”.

Note that if we apply LSA on the training set, we have to apply the folding-in process on the testing set to cast the records into low-rank representation for the further processing, such that the dimensionality of testing set match that of training set.

For each record \(\vec {q}\) in the testing set, the folding-in process is

$$ \vec{q}_{k} = {\Sigma}_{k}^{-1}\mathbf{U}_{k }^{T } \vec{q} $$
(3)

Regarding the model construction for classification, in our study, we apply “basic classifier”, “ensemble method” and “cost-sensitive method” in search of the best practice of risk assessment. The three kinds of approaches have been described in section “Introduction”.

In the last step of validation, the classification performance is measured by the common metrics presented as follows.

Data with imbalanced class distribution are common in some of real and medical applications. Only the accuracy measure is not suitable for evaluating classification model derived from imbalanced data set. In this study, the baseline of classification could be as high as 97.64 %. Without careful consideration, trivial applying of existing classification algorithms may not effectively detect instances of the rare class. That is, all instances are predicted to be low risk. As a result, those derived models become useless even though the accuracy is high enough.

In this subsection, we introduce some performance metrics, including accuracy, recall, precision, and ROC. In reference to the confusion matrix shown in Table 4, the metrics are convenient ways of comparing classifiers and defined as follows.

Table 4 The confusion matrix

Accuracy

A correct classifier mean predicts the same class as the original class of the test data. The accuracy of prediction system is the degree of closeness between the predicted class and actual class.

$$ Accuracy=\frac{TP+TN}{TP+FN+FP+TN} $$
(4)

Recall and precision

Recall is the fraction of relevant instances that are retrieved, while precision the fraction of retrieved instances that are relevant. Precision can be thought of as a measure of exactness, whereas recall is a measure of completeness.

$$ Recall=\frac{TP}{TP+FN} $$
(5)
$$ Precision=\frac{TP}{TP+FP} $$
(6)

ROC

The ROC (Receiver Operating Characteristics) of a classifier shows its performance as a relative trade-off between sensitivity (true positive rate) and specificity (one minus the false positive rate).

$$ TPR=\frac{T P}{TP+FN} $$
(7)
$$ FPR=\frac{FP}{FP+TN} $$
(8)

The TPR (true positive rate), which can be interpreted as benefits, defines how many correct positive results occur among all positive samples available during the test. On the other hand, FPR (false positive rate), which can be interpreted as cost, defines how many incorrect positive results occur among all negative samples available during the test.

As shown in Fig. 3, a ROC space is defined by FPR and TPR as the x axis and the y axis, respectively. A single point (i.e., the pair of values illustrated in the ROC space) indicates a prediction result of a classifier.

Fig. 3
figure 3

An illustration of ROC space [15]

The best possible prediction method would yield a point in the upper left corner or coordinate (0, 1) of the ROC space, representing 100 % sensitivity (no false negatives) and 100 % specificity (no false positives). The (0, 1) point is also called a perfect classification. A classifier of completely random guess would give a point which lies along a diagonal line from the left bottom (0, 0) to the top right (1, 1) corners.

In addition, the area under ROC curve, denoted by AUC, is a single index for measuring the performance of classifier. The larger the AUC, the better is overall performance of classifier.

Results

Some key findings are summarized as follows.

Sampling technique

(over-sampling vs. under-sampling) − In reference to Table 5 and Table 6, the average recall of classifiers by using under-sampling is 54.3 %, while the average recall by using over-sampling is 46.8 %. In general, we conclude that the performance of under-sampling is better than that of over-sampling. Note that, if not applying sampling techniques, prediction would tend to majority instances, and the high-accuracy classifiers still become meaningless.

Table 5 Performance vs. sampling technique (under-sampling; training data size: five percent of input data set)
Table 6 Performance vs. sampling technique (over-sampling; training data size: ten percent of input data set)

Ensemble method

− We conclude that, in general, the performance of ensemble method is better than that of basic and single classifiers. For some ensemble classifiers, such as Stacking, the high recall meets our goal of this study (in reference to Table 9). The performance is summarized in Tables 7, 8, 9 and 10.

Table 7 Performance vs. ensemble method (AdaBoost with various ‘base’ classifiers; training data size: five percent of input data set)
Table 8 Performance vs. ensemble method (Bagging with various ‘base’ classifiers; training data size: five percent of input data set)
Table 9 Performance vs. ensemble method (Stacking with various ‘base’ classifiers; training data size: five percent of input data set)
Table 10 Performance vs. ensemble method (VOTE with various ‘base’ classifiers; training data size: five percent of input data set)

Cost-sensitive classifier

− The higher cost (high value) of FN implies the high penalty of making wrong decisions. The performance of cost-sensitive methods for various assignments is illustrated in Fig. 4. With respect to various base classifiers, the performance is summarized in Table 11. As applying the cost-sensitive classifier based on various classifiers, the recall almost reach 100 %. In our study, the cost-sensitive classifier with RF is the best setting.

Fig. 4
figure 4

Cost-sensitive method (RF classifier) use different cost setting of false negative case

Table 11 Performance vs. cost-sensitive methods (various ‘base’ classifiers; training data size: five percent of input data set)

Dimension reduction (LSA)

− With respect to the BIRADS data set consisting of twelve attributes, we apply LSA to reduce the dimension of data set.

In our study shown in Fig. 5, the best assignment of k-value of SVD is seven, which indicates that the original data set in twelve dimensional space can be reduced in the seven dimensional space. Meanwhile, the performance, in terms of accuracy, recall/precision, and AUC, almost remain the same(Fig. 6).

Fig. 5
figure 5

The effectiveness of applying LSA on BIRADS for various k values

Fig. 6
figure 6

The visualization of BIRADS data by applying LSA, we only show the first three concepts (i.e., y 1, y 2, y 3); k = 7; The ‘circle’ denotes “high risk”; the ‘plus’ denoted “low risk”)

The energy concentration ratio is a measure of data set information concentration range. The energy E(\(\vec {x}\)) of a vector \(\vec {x}\) in n-dimensional space is defined as the sum of energies at every point of the vector:

$$ E(\vec{x}) = {\sum\limits_{i=1}^{n} |x_{i}|^{2}} $$
(9)

where \(\vec {x}\) = (x 1,x 2,...x n ).

Given an original vector \(\vec {x}\) = (x 1, x 2,..., x 12), and the transformed vector \(\vec {y}\) = (y 1, y 2,...,y 12), the energy concentration ratio of k (denoted, ratio-of-k) is a measurement of top-k strongest coefficient of the transformed vector.

$$ \textit{ratio-of-k} = \dfrac {{\sum}_{i=1}^{k} y_{i}^{2} }{ {\sum}_{i=1}^{12} y_{i}^{2} },~1 \leq k \leq 12. $$
(10)

The ratio is between 0 and 1. The higher ratio-of-k indicates that choosing the first-k coefficient is better enough to represent the whole vector and having less squared error (sum of squares of omitted coefficient) between the reduced vector and the original vector.

By applying LSA, the seven (k = 7)“concepts” (denoted by y 1, y 2, y 3, y 4, y 5, y 6 and y 7) are described as follows.

For better illustration, these derived seven “concepts” (i.e. (11) ∼ (17)) are visualized in Fig. 7 and Fig. 8.

Fig. 7
figure 7

An illustration of three “concepts” derived by LSA (denoted by y 1, y 2 and y 3, left to right)

Fig. 8
figure 8

An illustration of four “concepts” derived by LSA (denoted by y 4, y 5, y 6 and y 7, left to right, up to down)

For example, considering the ‘1-st concept’ (in reference to (11) and Fig. 7), the concept is predominated by the first, second, third, and seventh attributes.

$$\begin{array}{@{}rcl@{}} y_{1}&=&(-0.2385)x_{1}+(-0.9019)x_{2}+(-0.3234)x_{3}+(-0.0002)x_{4} \\ && +(-0.0002)x_{5}+(-0.0002)x_{6}+(-0.1577)x_{7}+(-0.0015)x_{8} \\ &&+(-0.0014)x_{9}+(-0.0005)x_{10}+(-0.0099)x_{11}+(-0.0046)x_{12} \end{array} $$
(11)
$$\begin{array}{@{}rcl@{}} y_{2}&=&(0.7487)x_{1}+(-0.2770)x_{2}+(-0.0718)x_{3}+(-0.0004)x_{4} \\ && +(0.0009)x_{5}+(0.0010)x_{6}+(0.5961)x_{7}+(0.0036)x_{8} \\ && +(0.0028)x_{9}+(0.0105)x_{10}+(0.0449)x_{11}+(0.0015)x_{12} \end{array} $$
(12)
$$\begin{array}{@{}rcl@{}} y_{3}&=&(0.0303)x_{1}+(0.3300)x_{2}+(-0.9434)x_{3}+(0.0000)x_{4} \\ && +(-0.0006)x_{5}+(0.0002)x_{6}+(0.0026)x_{7}+(-0.0046)x_{8} \\ && +(0.0015)x_{9}+(-0.0002)x_{10}+(-0.0104)x_{11}+(-0.0030)x_{12} \end{array} $$
(13)
$$\begin{array}{@{}rcl@{}} y_{4}&=&(0.6068)x_{1}+(-0.0274)x_{2}+(0.0067)x_{3}+(-0.0009)x_{4} \\ && +(-0.0005)x_{5}+(-0.0005)x_{6}+(-0.7814)x_{7}+(0.0192)x_{8} \\ && +(0.0082)x_{9}+(0.1076)x_{10}+(0.0688)x_{11}+(0.0603)x_{12} \end{array} $$
(14)
$$\begin{array}{@{}rcl@{}} y_{5}&=&(-0.0823)x_{1}+(0.0082)x_{2}+(-0.0111)x_{3}+(-0.0042)x_{4} \\ &&+(-0.0022)x_{5}+(-0.0038)x_{6}+(0.0330)x_{7}+(0.0041)x_{8} \\ && +(-0.0353)x_{9}+(-0.0376)x_{10}+(0.9684)x_{11}+(0.2269)x_{12} \end{array} $$
(15)
$$\begin{array}{@{}rcl@{}} y_{6}&=&(-0.0119)x_{1}+(-0.0001)x_{2}+(0.0008)x_{3}+(-0.0037)x_{4} \\ &&+(-0.0010)x_{5}+(-0.0005)x_{6}+(0.0064)x_{7}+(0.0128)x_{8} \\ && +(0.9980)x_{9}+(0.0257)x_{10}+(0.0439)x_{11}+(-0.0337)x_{12} \end{array} $$
(16)
$$\begin{array}{@{}rcl@{}} y_{7}&=&(0.0153)x_{1}+(0.0035)x_{2}+(0.0019)x_{3}+(-0.0006)x_{4} \\ && +(0.0025)x_{5}+(-0.0009)x_{6}+(-0.0334)x_{7}+(0.0499)x_{8} \\ && +(-0.0447)x_{9}+(0.0625)x_{10}+(0.2299)x_{11}+(-0.9682)x_{12} \end{array} $$
(17)

In summary, we illustrate the overall performance in the ROC space and P-R diagram (precision-recall diagram) as follows.

  1. 1.

    Performance in the ROC space: see Figs. 9, 10, and 11.

  2. 2.

    Performance in the P-R diagram: see Figs. 12, 13, and 14.

Fig. 9
figure 9

An illustration of overall performance in the ROC space (under-sampling; basic classifiers)

Fig. 10
figure 10

An illustration of overall performance in the ROC space (under-sampling; ensemble classifiers)

Fig. 11
figure 11

An illustration of overall performance in the ROC space (under-sampling; cost-sensitive classifiers)

Fig. 12
figure 12

An illustration of overall performance in the P-R diagram (under-sampling; basic classifiers)

Fig. 13
figure 13

An illustration of overall performance in the P-R diagram (under-sampling; ensemble classifiers)

Fig. 14
figure 14

An illustration of overall performance in the P-R diagram (under-sampling; cost-sensitive classifiers)

Discussion

In this section, we present the discussion based on the experiment results.

  1. 1.

    In our study, we apply the cost-sensitive method to construct a computation model which meets our goal of high recall and reasonable precision.

  2. 2.

    The higher cost setting of FN case (indicating the penalty of misclassification), the better we are able to approach our goal of “no false dismissals”.

  3. 3.

    As shown in Fig. 15, that is a trade-off between recall and specificity. When the recall is 100 %, the specificity is 14.87 %. If we slightly decrease the FN cost, the recall will be down to 86 % and the specificity will be up to 34.84 %.

  4. 4.

    When building classification model of imbalance data, the sampling technique is crucial to reinforce the class boundary.

  5. 5.

    In our study, we also apply the dimension reduction technique (LSA) to reduce the size of data set for model construction. To be more specific, the reduced data set is about 58 % of the original data set. Meanwhile, the performance almost remains the same.

Fig. 15
figure 15

Cost-sensitive methods with random forest classifiers (recall vs. specificity)

Conclusion

In our paper, we make use of patient health information to build a computational model for predicting the risk of breast cancer. Our goal is to construct a low-cost pre-diagnosis program which guarantees “no false dismissals” (i.e., a 100 % recall/sensitivity). The system architecture consists of four major components, including the preprocessing module, the sampling modules, the dimension reduction module, and classifiers. Based on our performance evaluation, we conclude that: (1) apply the under-sampling technology; (2) apply LSA dimensional reduction; (3) choose cost-sensitive method in which the random forest is the base classifier. Our approach is able to achieve the recall/sensitivity as 100 %. The precision and specificity is 2.9 % and 14.87 %. As a result, before mammography screening and early diagnosis, our model could be successfully applied to predict the risk of breast cancer in the clinical application.