Prognostic information has always been important to patients and clinicians, but with the advent of a greater range of treatment options for patients, prediction has become increasingly important as well. This is particularly true with ovarian cancer, which is an extremely heterogeneous disease. Patients present at various stages of the disease, and the tumors not only are of varying grades but also display a range of histologic subtypes.

Currently, outside of clinical trials, patients are treated with surgical cytoreduction followed by platinum-based chemotherapy.1 Whereas age,2 stage,3 grade,4 histologic type,5 preoperative CA125 levels,6 and result of surgery7 all have been associated with overall survival, multivariate analysis consistently shows only stage and result of surgery to be independent prognostic factors.8

However, studies have shown new treatment options including antiangiogenic treatment,9 poly(ADP-ribose) polymerase (PARP) inhibitor therapy,10 and hormonal therapy11 to be effective for at least some subgroups of patients with ovarian cancer. Although biomarkers are being developed to identify patients likely to respond to each of these treatments,1214 it is unclear how these treatments should be scheduled and prioritized. Furthermore, although the role of surgery in ovarian cancer is well established, it is clear that not all patients benefit from surgery, and for some patients, the morbidity of the procedure outweighs any potential benefit. It is therefore important to develop more powerful prognostic and predictive tools that allow patients and clinicians to plan treatment both optimally and realistically.

Artificial intelligence (AI) systems are in common everyday use and provide many benefits including ability to handle enormous amounts of data, to cope with missing data items, and to evolve in the presence of new data. In general, they attempt to find better solutions rather than the absolutely best or correct solution.15

Simplistically, AI systems comprise an algorithm and a classifier. Several different algorithms and classifiers have been developed, all of which have relative merits. Further information regarding these is given in the supplementary materials.

This study therefore investigated the role of AI systems in creating a prognostic and predictive tool for ovarian cancer. We generated a historical data set from patients treated at our institution and extracted data that would be readily available after initial surgery and staging, mimicking the clinical situation, as well as overall survival data. We then used a systems-based approach to investigate the optimal method for determining a range of outcomes.

Methods

The Data Set

The Newcastle ovarian cancer tissue and data set has been described previously.16 Briefly, the data set comprises 668 cases of epithelial ovarian cancer. All patients underwent maximal-effort primary cytoreductive surgery followed by platinum-based chemotherapy either with or without paclitaxel between 1995 and 2005 at the Northern Gynaecological Oncology Centre, Gateshead, UK. No patient received intraperitoneal chemotherapy during the study period.

For each tumor sample, all histology reports and patient records were retrieved wherever possible, and data were extracted after ethical approval. Data were extracted from the patient record, including age, International Federation of Gynecology and Obstetrics (FIGO) stage, grade, histologic subtype, preoperative CA125, and outcome of surgery (Table 1, supplementary materials). All these data items are available to a clinician immediately after staging surgery. In fact, given the increased use of pretreatment biopsy and the routine use of cross-sectional imaging, these data items all can be provided to a clinician before any treatment is mandated. These data items were termed “features” for the purpose of this study. Although other features could have been considered for inclusion in the study, the aforementioned list was considered the minimum data set that would be available to all clinicians.

An important trend during the study period was the increase in complete cytoreduction rates from 30 % in 1995 to 55 % in 2005. Overall survival (all-cause mortality) also was recorded as an end point. Median survival increased from 25 months in 1995 to 35 months in 2005.

Generation of the AI System

A system then was created to generate overall patient survival as the output. The system was developed on a Java-based platform capable of analyzing multiple parameters using any of three algorithms [decision tree (DT), artificial neural network (ANN), and Bayesian network (BN)] and any of three classifiers [support vector machines (SVM), naïve Bayes (NB), and K-nearest neighbor (KNN)] (for further information, see the supplementary materials).

The system was thus capable of making multiple comparisons between different combinations of algorithms and classifiers using the data set as described earlier. Each analysis was stored and compared with the percentage performance of each technique on the appropriate data sets and sub-data sets.

Experimental Methodology

To compare different machine learning techniques and data sets, we first generated a process definition (Fig. 1). Briefly, the data set was first randomized. The initial training and evaluation of the model started with 190 cases randomly selected from the data set. Sequentially, 18 new cases then were added to the data set until all 668 available cases were included. This process allowed investigation of the model’s performance by increasing the number of data items and allowed discovery of the optimal stop-training point. Ultimately, this means that for each outcome investigated, a total of 7668 analyses were performed.

Fig. 1
figure 1

Process definition algorithm. From the total cohort, 190 cases were selected at random and used as the test set. The remaining cases then were used as a validation set

Each interval of the process contained both a training process and an evaluation process. The tenfold cross-validation was used to estimate the performance prediction.17

Survival was used as the end point in all analyses. The survival period of the patients in the data set ranged from 1 to 116 months. In each analysis, the median survival period for that particular cohort was calculated, and the accuracy of the system was calculated by the ability of the system to predict survival more or less than the median for that cohort. Patients for whom follow-up data were not available to at least the median time point were excluded from the analysis. Other performance indicators used were accuracy, recall, mean absolute error (MAE), root mean squared error (RMSE), and area under the curve (AUC), reviewed in the study of Cort and Kenji.18

Results

Artificial Neural Networks Have the Highest Accuracy for Prognosis

Ability to Predict More or Less Than Median Survival. The model was used to predict survival, dichotomized around the median survival of the cohort. As outlined earlier, the model was tested initially with a random set of 190 cases and validated with the remainder. In this test, ANN outperformed the Bayesian networks and the decision tree, which had only limited predictive capacity (Fig. 2), and gave an overall predictive accuracy of 89 %. The AUC for the receiver operating characteristic curve (ROC) was 0.74, suggesting good prediction, and the MAE and RMSE had low scores of 0.16 and 0.30, respectively. Thereafter, all analyses were performed using ANNs.

Fig. 2
figure 2

Receiver operating characteristic (ROC) curve showing performance of model in predicting survival above the median using three different algorithms: Bayesian networks (BN), decision-tree analysis (DT), and artificial neural network (ANN). The latter clearly outperforms both BN and DT. Area under the curve (AUC) is 0.74

When the number of cases in the validation set was increased in steps of 18 cases, the accuracy of the system also improved slightly to 92 % when the maximum 658 cases were used for test cases and the remaining 10 were used as validation cases.

Ability to Determine Prognosis fo r Smaller and Smaller Groups. The ability to determine survival to more or less than a single time point is of limited clinical value. We therefore wanted to investigate the performance of the system to determine prognosis in increasing numbers of categories. We therefore divided the cohort into quartiles and then into 8, 16, 32, and ultimately, 64 categories.

As expected, the accuracy of the model decreased, whereas the MAE and RMSE increased in relation to prediction of survival with increasing numbers of categories from 89 % for 2 categories to just 6 % for 64 categories (Fig. 3).

Fig. 3
figure 3

a As the number of prediction categories was increased from 2 to 64, accuracy of the model decreased, although error as measured by b mean absolute error (MAE) and c root mean squared error (RMSE) was not aversely affected

Ability to Predict 5-Year Survival. Overall survival can be defined in terms of median survival or 5-year survival rates. To allow comparison of our model against other published data, we therefore calculated the accuracy of our system in predicting 5-year (60-month) survival. The accuracy of the model in predicting 5-year survival was 93 %. The error rate values were relatively small, with an MAE of 0.16 and RMSE of 0.30. The area under the ROC curve (AUC) of 0.74 indicated that the model performed well.

AI Can be Used to Provide Predictive Information Regarding Surgical Resection

In addition to use of the model as a prognostic tool, we tested its ability to predict outcome. We chose to evaluate the ability of the system to test outcome of surgery because this is arguably one of the most important treatment decisions needing to be made with a patient.19 The features used were age, stage, grade, histologic type, and preoperative Ca125. All these features would be available to a surgeon after a biopsy but before performance of a laparotomy.

The accuracy of the model to predict complete, optimal, or suboptimal cytoreduction was 59.75 %, with an MAE value of 0.34 and an RMSE value of 0.42. The AUC for complete cytoreduction curve was 0.75. The AUC for optimal cytoreduction was 0.66, and the AUC for suboptimal cytoreduction was 0.71 (Fig. 4).

Fig. 4
figure 4

Receiver operating characteristic (ROC) curve demonstrating ability of the model to predict the outcome of surgery using an artificial neural network (ANN) algorithm. The model was used to predict a complete [area under the curve (AUC), 0.75], b optimal cytoreduction (AUC, 0.66), and c suboptimal cytoreduction (AUC, 0.71)

The question then was further refined to give a prediction of “complete and optimal” as the one category and “suboptimal” as the other category. In this situation, the performance of the model improved to give an accuracy of 77.7 % associated with an MAE of 0.29 and an RMSE of 0.38.

Comparison of the AI System with Conventional Statistics

To have a complete comparison between models, logistic regression models as standard conventional statistical prediction models were developed. The results of the logistic regression models then were compared against the ANN results.

Although Cox proportional hazards models are more commonly used for survival analysis and can be applied to predict survival probabilities, such models are better suited for making comparisons between two different treatment arms (comparing the hazards between each). Given that this study aimed to develop prediction for an individual patient, logistic regression models were more appropriate.

The model, using ANN as the algorithm, consistently outperformed logistic regression in its ability to predict survival more or less than the median for each cohort. In 5998 (75 %) of 7998 analyses, ANN had a greater accuracy than logistic regression (Fig. 5).

Fig. 5
figure 5

Receiver operating characteristic (ROC) curve demonstrating prediction of survival less than the median survival of the cohort comparing artificial neural network (ANN) with logistic regression (LR). The ANN is associated with better prediction [area under the curve (AUC), 0.72] than LR (AUC, 0.62)

Using AUC to make a direct comparison, the AUC for logistic regression was 0.62 compared with the AUC of 0.72 for ANN. Compared with logistic regression, ANN also was associated with lower MAE and RMSE scores.

Discussion

This report describes the development of an AI model capable of providing prognostic and predictive information in ovarian cancer. The model uses readily available data items as input variables. We have measured success of the model in predicting overall survival, which is the most clinically important outcome measure. Thus, this tool may have direct clinical relevance for patients and their treating physicians. It also may be of use in the research setting in which reliable outcome data must be obtained without long follow-up periods.

With the system optimized, we are able to predict survival for patients in our cohort with an accuracy of 93 % compared with the median survival of the group (AUC, 0.72). Clearly, the ability to predict whether a patient will survive to more or less than a single time point has limited clinical use, but given the size of this data set, we believe that these are encouraging results. The ability of the system to predict survival with more outcome categories decreased, but even when 64 outcome categories were used (which is almost equivalent to predicting survival by month), the system could accurately predict survival in 6 % of cases.

We also used the model to provide predictive information and chose outcome of surgery (the ability to identify patients likely to end up with suboptimal cytoreduction) as an important clinical question.19 To date, conventional predictive systems have had limited accuracy.20 Although our model also had only limited accuracy (77 %), this was based on data easily available to all clinicians and did not take into account any findings from radiologic investigations or comorbidities of the patient. Such features can easily be incorporated into an AI system, and future investigations will examine whether incorporating such features improves the accuracy of the model further. These results are consistent with those of Jefferson et al.,21 who showed that neural networks can outperform logistic regression in predicting surgical outcome in lung cancer.

Our results suggest that the model described in this report is a better predictor of outcome than a system based on conventional statistics. Two systems based on conventional statistics have been described for use in ovarian cancer: Gerestein et al.22 developed a nomogram using three or five parameters that has a c statistic of 0.67, whereas Teramukai et al.23 developed a similar nomogram, which has subsequently been validated in a UK-based population of patients.24 Both of these models are finite (i.e., their performance is fixed and will not improve with time). In contrast, a systems-based approach, such as the one described in this report, can be iterative, and performance is likely to improve as the data set enlarges. In this report, we demonstrated modest increases in performance as the number of cases was increased, which supports this concept. It also is likely that as the number of cases in the data set increases, the relative weightings of some of the input parameters will change. Thus, we will continue to input as many data items as are available and monitor the performance of the system to test this further.

Ovarian cancer is an extremely heterogeneous disease with a variety of different histologic subtypes and a wide range of responses to treatment. By developing a model in this disease, we are confident that similar models also can be developed in more homogeneous cancers. Artificial intelligence systems also have other significant advantages over conventional models. In particular, they can cope with missing data items and will incorporate algorithms to overcome this (see supplementary data). Their greatest advantage, however, is that they are iterative, and performance improves as the data sets increase in size and relevance. Therefore, although this model has been tested in only one center, it could be adapted for use in multiple centers but would need to “learn” how to predict in each center given that the outcomes (e.g., for surgery) are likely to be different.

Current research in ovarian cancer is focused on developing new treatments, many of which will be biomarker driven (i.e., a test will be available that will dictate whether the patient is likely to respond to the treatment in question). Such a model is already standard care in breast cancer, with a test that identifies the presence of the HER2 receptor and dictates whether a patient receives herceptin treatment.

Therefore, within a relatively short time, we likely will have a multitude of new treatments available, all associated with a particular biomarker test. The challenge for the clinician will be to prioritize these treatments and determine the optimal treatment strategy. Because conventional algorithms probably will not be capable of handling such complex interactions as are likely to exist, the development of AI systems will become paramount. We hope that the system described in this report can be the basis for the development of these more complex models.

Future work will include increasing the size of the data set to increase the performance of the system and validating it further on other external data sets of patients.