1 Research problem

Cervical cancer remains one of the leading causes of cancer-related death among women globally [1, 2]. Even though the morbidity and the mortality have been decreasing in recent years, the morbidity rate of cervical cancer is the second leading type in women. In Taiwan, cervical carcinoma is the second most common malignancy for women and contributing to a quarter of all female cancer cases. It remains one of the most pressing medical problems for women. The natural history of cervical carcinoma begins with a normal epithelium which progress through various stages of dysplasia—cervical intraepithelial neoplasia grade CIN 1, CIN 2, CIN 3—and finally, to invasive cervical carcinoma (ICC). There is a long time interval for the progression to ICC, and consensus on the fact that regression occurs in CIN. The most important part of therapy is to detect and eradicate local CIN 3 lesions before the progression to ICC and metastasis can occur [3].

The cure rate of cervical cancer is quite high if detected early, but approximately 30 % of International Federation of Gynecology and Obstetrics (FIGO) stage IB2 to stage IV disease will ultimately recur with modern multimodality treatment [4, 5]. Once the primary treatment has failed, the opportunity of secondary cure is slim. Patients with recurrent disease or pelvic metastases have a poor prognosis with a 1-year survival rate between 15 and 20 % [6]. Since, the treatment for recurrent cervical carcinoma is still a clinical challenge. The existing literature on recurrent cervical cancer reveals that factors include (1) Age, (2) Cell Type, (3) Tumor Grade, (4) Tumor Size, (5) Pathologic T, (6) Pathologic Stage, (7) Surgical Margin Involvement, (8) Lymph Node Metastases (LNM), (9) Number of Fractions of Other RT, (10) RT target Summary, (11) Sequence of Locoregional Therapy and Systemic Therapy and (12) Lympho-Vascular Space Involvement (LVSI). Within these broad categories, many studies have attempted to identify the plethora of factors that could enhance clinical management in the intervention of cervical cancer [7, 8]. Machine learning, which involves the retrieval and analysis of large amounts of data from a data warehouse, has been successfully used to uncover hidden patterns (or rules) among data in a variety of fields. In these circumstances, an alternative solution is to use modern machine learning technology to identify key associations between relevant factors on cervical cancer [9, 10]. In this regard, Louie et al. [11] and Kim et al. [12] used epidemiological analysis to explore the critical factors that might be used by management in the achievement of patients’ survivability. However, given that most techniques of prediction involve various combinations of many potential causative factors, it is unlikely that a single factor can be identified that provides a full explanation of the prognosis of serious recurrence in the cervical cancer. Against this background, the present study attempts to improve surveillance after treatment might lead to earlier detection of relapse, and precise assessment of recurrent status could improve outcome.

Section 2 presents a review of literature relevant to the concept of methodology. Sect. 3 presents empirical application of the cervical cancer. Finally, Sect. 4 concludes the discussion of this study.

2 Method

In the health field, machine learning applications have been growing considerably as it can be used to directly derive patterns, which are relevant to forecast different risk groups among the patients. [13, 14] To the best of our knowledge, machine learning technique such as classification has not been used to analyze the recurrence cervical cancer. Hence, in this paper, we made an attempt to identify patterns from the database of the cervical cancer patients using several advances techniques as follows.

2.1 SVM

The basic idea of applying SVM to classification can be stated briefly as follows. We can initially map the input vectors into one feature space (possible with a higher dimension), either linearly or nonlinearly, which is relevant with the selection of the kernel function [15]. Then, within the feature space from the first, we seek an optimized linear division, that is, construct a hyperplane which separates two classes (this can be extended to multi-class).

A description of SVM algorithm is as follows: Let \( \left\{ {\left( {{\mathbf{x}}_{i} ,y_{i} } \right)} \right\}_{i = 1}^{N} \), \( {\mathbf{x}}_{i} \in R^{d} \), \( y_{i} \in \left\{ { - 1,{ 1}} \right\} \) be the training set with input vectors and labels. Here, N is the number of sample observations, d is the dimension of each observation, and \( y_{i} \) is known target. The algorithm is to seek the hyperplane \( {\mathbf{w}} \cdot {\mathbf{x}}_{i} + b = 0 \), where w is the vector of hyperplane and b is a bias term, and to separate the data from two classes with maximal margin width \( 2/\left\| {\mathbf{w}} \right\|^{2} \), and the all points under the boundary are named support vector. In order to optimal the hyperplane that SVM was to solve, the optimization problem was the following [16].

$$ \begin{gathered} {\text{Min}}\,\Upphi ({\mathbf{x}}) = \frac{1}{2}\left\| {\mathbf{w}} \right\|^{2} \hfill \\ {\text{s}} . {\text{t}} .\,y_{i} ({\mathbf{w}}^{T} {\mathbf{x}}_{i} + b) \ge 1,\quad \, i = 1, \, 2, \ldots ,N \hfill \\ \end{gathered} $$
(1)

It is difficult to solve (1) and to transform the optimization problem to dual problem by Lagrange method. In the Lagrange method, the value of alpha must be nonnegative real coefficients. The (1) is transformed into the following constrained form

$$ \begin{gathered} {\text{Max}}\,\Upphi ({\mathbf{w}},b,\xi ,\alpha ,\beta ) = \sum\limits_{i = 1}^{N} {\alpha_{i} - \frac{1}{2}\sum\limits_{i = 1,\,j = 1}^{N} {\alpha_{i} \alpha_{j} y_{i} y_{j} {\mathbf{x}}_{i}^{T} {\mathbf{x}}_{j} } } \hfill \\ {\text{s}} . {\text{t}} .\,\sum\limits_{j = 1}^{N} {\alpha_{j} y_{j} = 0} ,\,0 \le \alpha_{i} \le C ,\quad i = 1, \, 2, \ldots ,N \hfill \\ \end{gathered} $$
(2)

In (2), C is the penalty factor and determines the degree of penalty assigned to an error. It can be viewed as a tuning parameter which can be used to control the trade-off between maximizing the margin and the classification error. In general, it could not find the linear separate hyperplane in all application data. In the nonlinear data, transforming the original data to higher dimension of linear separate is the best solution. The higher dimension is called feature space, and it improves the data separated by classification. The common kernel functions are linear, polynomial, radial basis function (RBF) and sigmoid. Although several choices for the kernel function are available, the most widely used kernel unction is the RBF kernel defined as \( K({\mathbf{x}}_{i} ,{\mathbf{x}}_{j} ) = \exp ( - \gamma \left\| {{\mathbf{x}}_{i} - {\mathbf{x}}_{j} } \right\|^{2} ),\gamma \ge 0 \) [16], where \( \gamma \) denotes the width of the RBF. Thus, the RBF is applied in this study. The original SVM was designed for binary classifications. Constructing multi-class SVM is still an ongoing research issue. In this study, we used multi-class SVM method proposed by Hsu and Lin [17]. For more details, please refer to [17].

2.2 C5.0

C5.0 classifier is a process for the classification and analysis of information hidden in large data sets/databases, which retrieves useful information in the form of a decision tree, that is, a flowchart-like tree structure [18]. The algorithm adopts a greedy approach in which the decision trees are constructed in a top-down recursive divide-and-conquer manner on the basis of a training set employing an attribute selection measure. C5.0 is more efficient than C4.5 such as: faster, more memory efficient, similar results by smaller decision trees, supports for more accuracy, weight different attributes and reduce more noise [18, 19]. C4.5 Quinlan [24] builds decision trees from a set of training data in the same way as ID3 (Iterative Dichotomiser 3), using the concept of information entropy. The training data are a set of already classified samples. Each sample is a vector including attributes or features. The training data are augmented with a vector representing the class that each sample belongs to. Each attribute of the data can be used to make a decision. [20, 21] C4.5 examines the normalized information gain that results from choosing an attribute for splitting the data. The attribute with the highest normalized information gain is the one used to make the decision. The algorithm then recurs on the smaller sub-lists.

Take calculating evaluation properties of A as an example, calculate information gain ratio GainRatio(A), S represents a set of samples, \( p_{i} \) is the probability that an arbitrary sample belongs to \( B_{i} \). Suppose that categorical attributes have n different values, which define n different classes \( B_{i} \), (i = 1, …,n). Suppose \( S_{i} \) is the number of samples in the class B. Info(S) indicates the information entropy in the current sample. The calculation process is as follows:

$$ {\text{Info}}(S) = \sum\limits_{i = 1}^{n} {p_{i} \log (p_{i} )} $$
(3)

Suppose attribute A has n different values {\( A_{1} ,A_{2} , \ldots ,A_{n} \)}, uses A to divide S into n subsets {\( S_{1} ,S_{2} , \ldots ,S_{n} \)}, and \( S_{j} \) is the sample that has \( A_{j} \) in A, \( S_{ij} \) is the sample number of class \( B_{i} \) in subset \( S_{j} \). Info(S, A) is the needed information entropy.

The calculation progress is as follows:

$$ {\text{Info}}(S,A) = \sum\limits_{j = 1}^{n} {\frac{{S_{1j} + S_{2j} + \cdots + S_{nj} }}{S}{\text{Info}}\,(A)} $$
(4)

The split information SplitInfo (A) is the entropy of each value of attribute A about S, and it is used to eliminate deviation of attribute that has a large number of value attributes. The calculation progress is as follows:

$$ {\text{SplitInfo}}\,(A) = - \sum\limits_{i = 1}^{n} {\frac{{\left| {S_{j} } \right|}}{\left| S \right|}} \log \left( {\frac{{\left| {S_{j} } \right|}}{\left| S \right|}} \right) $$
(5)
$$ {\text{Gain}}\left( A \right) = {\text{Info}}\left( S \right) - {\text{Info}}\left( {S,\,A} \right) $$
(6)
$$ {\text{GainRatio}}\left( A \right) = {\text{Gain}}\left( A \right)/{\text{SplitInfo}}\left( A \right) $$
(7)

2.3 ELM

Extreme learning machine (ELM) proposed by Huang et al. [17] is a single hidden layer feed-forward neural networks (SLFNs) which randomly selected the input weights and analytically determines the output weights of SLFNs. ELM has been successfully applied to many real-world applications [21, 22, 23, 24]. It not only can be thousands of times faster than traditional feed-forward network learning algorithms like back-propagation (BP) algorithm while obtaining better generalization performance, but also avoids many difficulties presented to gradient-based methods such as stopping criteria, learning rate, learning epochs, local minimal and overtuning issues

Consider N arbitrary distinct samples (\( x_{i} ,t_{i} \)) where\( x_{i} = [x_{i1} ,x_{i2} , \ldots ,x_{in} ]^{T} \in R^{n} \), and \( t_{i} = [t_{i1} ,t_{i2} , \ldots ,t_{im} ]^{T} \in R^{m} \). SLFNs with \( \tilde{N} \) hidden neurons and activation function g(x) can approximate N samples with zero error. This means that

$$ H\beta = T $$
(8)

where \( H(w_{1} , \ldots ,w_{{\widetilde{N}}} ,b_{1} , \ldots ,b_{{\widetilde{N}}} ,x_{1} , \ldots ,x_{{\widetilde{N}}} ) = \left[ {\begin{array}{*{20}c} {g(w_{1} \cdot x_{1} + b_{1} )} \hfill & \cdots \hfill & {g(w_{{\widetilde{N}}} \cdot x_{1} + b_{{\widetilde{N}}} )} \hfill \\ \vdots \hfill & \ddots \hfill & \vdots \hfill \\ {g(w_{1} \cdot x_{N} + b_{1} )} \hfill & \cdots \hfill & {g(w_{{\widetilde{N}}} \cdot x_{N} + b_{{\widetilde{N}}} )} \hfill \\ \end{array} } \right]_{{N \times \widetilde{N}}} ;\,\beta_{{\widetilde{N} \times m}} = (\beta_{1}^{T} , \ldots ,\beta_{{\widetilde{N}}}^{T} )^{t} \); \( T_{N \times m} = (T_{1}^{T} , \ldots ,T_{N}^{T} )^{t} \), where \( w_{i} = [w_{i1} ,w_{i2} , \ldots ,w_{in} ]^{T} \), \( \, i = 1,2, \ldots ,\widetilde{N}, \) is the weight vector connecting the ith hidden node and the input nodes, \( \beta_{i} = [\beta_{i1} ,\beta_{i2} , \ldots ,\beta_{im} ]^{T} \) is the weight vector connecting the ith hidden node and the output nodes, and \( b_{i} \) is the threshold of the ith hidden node. \( w_{i} \cdot x_{j} \) denotes the inner product of \( w_{i} \) and \( x_{j} \). H is called the hidden layer output matrix of the neural network; the ith column of H is the ith hidden node output with respect to inputs \( x_{1} ,x_{2} , \ldots x_{N} \).

Thus, the determination of the output weights (linking the hidden layer to the output layer) is as simple as finding the least-square solution to the given linear system. The minimum norm least-square (LS) solution to the linear system [i.e., Eq. (8)] is

$$ \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\beta } = H^{\Uppsi } T $$
(9)

where \( H^{\Uppsi } \) is the Moore–Penrose generalized inverse of matrix H. The minimum norm LS solution is unique and has the smallest norm among all the LS solutions. Steps of ELM algorithm can be summarized as follows:

Step 1: Randomly assign input weight \( w_{i} \), and bias \( b_{i} \);

Step 2: Calculate the hidden layer output matrix H;

Step 3: Calculate the output weight \( \beta \), \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\beta } = H^{\Uppsi } T \), where \( T = (t_{1} , \ldots ,t_{N} )^{t} \).

3 Empirical study

In this study, the cervical cancer data set provided by the Chung Shan Medical University Hospital Tumor Registry is used in order to verify the feasibility and effectiveness of C5.0, SVM and ELM. Each patient in the data set contains 12 predictor variables, namely Age, Cell Type, Tumor Grade, Tumor Size, Pathologic T, Pathologic Stage, Surgical Margin Involvement, Lymph Node Metastases (LNM), Number of Fractions of Other RT, RT target Summary, Sequence of Locoregional Therapy and Systemic Therapy and Lympho-Vascular Space Involvement (LVSI). And the response variable is recurrent or no. There are totally 168 patients in the data set. Among them, 118 datasets with respect to the ratio of recurrent to non-recurrent patients (the prior probabilities or simply priors) were randomly selected as the training sample (estimating the parameters of the corresponding built classification models), while the remaining 50 will be retained as the testing sample (evaluating the classification capability of the built models). Note that the original data sets are first scaled into the range of [0, 1] before building C5.0, ELM and SVM classification models. The purpose of doing so is to ensure that large value input variables do not overwhelm smaller value inputs, thus helping to reduce classification errors.

In the modeling of C5.0 classification model, the predictor (or independent) variables should first be selected. Two significant independent variables were included in the final C5.0 model, namely Pathologic T and RT target Summary. The classification results (the confusion matrix) of the testing sample using the obtained C5.0 model are summarized in Table 1. From the results revealed in Table 1, we can observe that the average correct classification rate is 96.0 % with 0 (2) class 1 (2) patients misclassified as class 2 (1) patients (here a class 1 patient is defined as a patient with recurrent, while a class 2 patient is a patient with non-recurrent).

Table 1 Classification results using C5.0 model

For modeling the SVM classification model, all of the twelve predictor variables are used as inputs. The performance of the SVM model is mainly affected by the setting of two parameters (C and \( \gamma \)) since the RBF kernel function is adopted in this study. There are no general rules for the choice of the parameters. In this study, the grid search proposed by Hsu and Lin [14] is used for parameters setting. After using the Grid search method, the parameter set (C = 213, \( \gamma \)=2−5) is the best parameter set for the SVM model. Note that the model selection details of the three models are omitted for saving space.

The classification results of the testing sample using the obtained SVM model are summarized in Table 2. From the results in Table 2, it is observed that the average correct classification rate is 68.00 % with 2 (14) class 1 (2) patients misclassified as class 2 (1) patients.

Table 2 Classification results using SVM model

In the modeling of the ELM model, the input layer has twelve nodes as twelve independent variables are used. It is known that the most important and critical parameter of ELM is the number of hidden nodes and ELM tends to be unstable in a single run classification. Therefore, the ELM model with different numbers of hidden nodes varying from 1 to 30 was constructed. For each number of nodes, the ELM model is repeated 30 times and the number of hidden nodes that gives the smallest testing RMSE value is selected. In this study, the ELM model with 23 hidden nodes has smaller RMSE value and therefore is the best models of the ELM model.

Table 3 shows the classification result of the ELM model. The average correct classification rate is 94.00 % with 3 (0) class 1 (2) patients misclassified as class 2 (1) patients.

Table 3 Classification results using ELM model

From Tables 1, 2 and 3, it can be found that the average correct classification rates of the C5.0, SVM and ELM models are 96.00, 68.00 and 94.00 %, respectively. The C5.0 model has the best classification capability in terms of the average correct classification rate. It outperforms the SVM and ELM models and hence provides an efficient alternative in conducting cervical cancer classification tasks.

In order to assess the robustness of the C5.0 method, the performance of the C5.0, SVM, ELM models was tested using 10 independent runs. Based on the findings in Table 4, it can be observed that the ELM model has the highest average correct classification rate 93.14 % in {1-1} (a patient with recurrent is classified as recurrent), and the C5.0 model generates the highest average correct classification rate 91.27 % in {2-2} (a patient with non-recurrent is classified as non-recurrent). The highest average correct classification rate for the Overall is 92.44 % which is provided by the C5.0 model. Since the C5.0 model outperforms the SVM and ELM models under {2-2} and overall situations, it indicates that the C5.0 model indeed provides better classification accuracy than the other two approaches. Consequently, based on the results from this dataset, we can conclude that the C5.0 model is an effective alternative for cervical cancer classification.

Table 4 Robustness evaluation of the C5.0, SVM, ELM models

In this study, the C5.0 model not only generates the best classification result, but also can be used to select important independent variables for cervical cancer classification. The selected important independent variables can provide useful information for cervical cancer treatment. In this study, after 10 runs, the selected important independent variables are Pathologic Stage, Pathologic T, Cell Type and RT target Summary.

4 Conclusion

The cervical cancer can hide for a long time but it would be helpful if the doctor could predict the recurrence on a basis of a few variables shortly after the treatment. That is, in recurrent cervical cancer, the most important question is why the decision was made; when this is understood. The doctor is always expected to make the right decision. Years of clinical study and experience are needed to select the correct answer yet mistakes are always possible. To better predict the prognosis, many investigators have tried to identify risk factors for recurrence. Indeed, these include tumor size, lymphovascular space involvement, depth of tumor invasion and lymph node metastasis. Nevertheless, these factors are interrelated, but analyses using these factors do not reflect the true prognoses. The problem we investigate in this study is how to support a physician’s decision of whether variables are warranted. In our study, Pathologic Stage deeply invasive tumors and Pathologic T were independent risk factors, in contrast to other similar analyses [7, 8].

As a result, our findings support that Pathologic Stage and Pathologic T are important and independent prognostic factor. In particular, Cell Type and RT target Summary were significantly related to the recurrence. Due to the small number of patients studied, outcomes of different treatment modalities in patients with LVSI and deeply invasive tumors without LNM could not be further analyzed. The presented results suggest the decision tree is a good decision model. For medical interpretation, however, a further clinical cooperation with doctors is needed to verify the proposed model. Perhaps by analysis existing or easily measured data about a patient, we can develop some results by which a physician caring a patient can better decide when to take the critical intervention.