Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

While infectious disease continue to pose a threat to world health, in the words of the World Health Organization “it is the looming epidemics of heart disease, stroke, cancer and other chronic diseases that for the foreseeable future will take the greatest toll in deaths and disability”[1]. In fact, already 10 years ago the total number of people dying from chronic diseases was double that of all infectious diseases, maternal/perinatal conditions, and nutritional deficiencies combined [1]. The rise of these conditions can be traced to a complex web of interactions of common factors, such as genes, nutrition and life-style, with socio-economic status.

Since chronic conditions can be very costly but are also preventable there is great interest in building models that allow to simulate the costs and benefits of health interventions in this area, and that can be used for planning and policy purposes by government agencies and other interested stakeholders [24].

The risk of developing a chronic condition is highly dependent on factors such as obesity or smoking and on individual characteristics such as income and education. These factors vary greatly within the population, and therefore it is particular important to develop models that predict the onset of chronic conditions at individual level, that can then be used as components of simulation models to be applied to an entire population [2].

Since chronic conditions are quite correlated (for example diabetes and heart disease often go together) it is imperative to use models that make joint predictions, rather than modeling each condition separately. In the biostatistics literature this is usually done using multivariate probit models (MVP) [5, 6]. While MVP are very attractive because they are easily interpretable, they rely on a very simple and rather restrictive specification and they were designed more for the purpose of understanding the determinants of the outcomes, rather than for predicting the future.

From a machine learning viewpoint it is somewhat surprising that there have not attempts to use more sophisticated, and appropriate, type of models, such as Support Vector Machines (SVMs) or Deep Learning (DL) methods. We start to fill this gap by presenting, in this paper, a comparison between the predictive ability of MVPs and SVMs. We have chosen SVMs to start with mostly because the biostatistics community is very comfortable with R and at the moment there is somewhat more support in R for SVMs than for DL.

It is important to underscore that the ability to improve the accuracy of MVP predictions is not an academic exercise. What is of interest to policy makers is long term predictions (20 to 30 years), that can only be made by repeatedly applying shorter term predictions (from one to three years, depending on the availability of longitudinal data). Therefore even a small improvement in the accuracy of short-term predictions can result in large reduction in the uncertainty of the long-term estimates, having a large impact on the policy outcomes.

The rest of this article is organized as follows. Section 2 describes the data used in our experiment. Section 3 briefly describes the MVP and SVM models. Section 4 discusses the experimental results and Sect. 5 concludes the paper.

2 Data

In order to build a predictive model of chronic disease it is necessary to have longitudinal data, in which the same individual has been observed at least twice. Since we are interested in predicting several chronic conditions at once, and since the joint prevalence of certain conditions is not very high, the data sets needs to be quite large in order to capture some of those combinations. There is a dearth of longitudinal data that can be used for this purpose, and one of the largest is the Australian Social, Economic, and Environmental Factory (SEEF) study, a follow up to the 45 and Up Study survey [7]. The approval for this study is provided by the NSW Population & Health Services Research Ethics Committee (AU RED reference:HREC/15/CIPHS/4).

The 45 and Up Study survey (www.saxinstitute.org.au), which was carried out between 2006 and 2009, contains information regarding the health and social wellbeing of 267,153 individuals aged 45 years and older living in New South Wales (NSW), Australia. Eligible individuals, sampled from the Medicare population of NSW, were mailed the questionnaire, an information sheet and a consent form and provided with a reply paid envelope. The survey over-sampled individuals aged 80 years and over and residents of rural areas by a factor of two. In addition, all residents aged 45 years and older in remote areas were sampled. The overall response rate of the 45 and Up Study is 18 %, accounting for approximately 10 % of all individuals of age 45 years or older living in NSW. While the response rate is low and participants tended to be of more favorable socioeconomic circumstances than average for the age group, previous work has shown that analytical findings based on internal comparisons, such as odd-ratios, are generalizable and comparable to those derived from smaller but more representative population health surveillance [8].

Data captured in the 45 and Up Study baseline include a number of self-reported chronic conditions such as (ever diagnosed) heart disease, high blood pressure, diabetes, stroke, asthma, depression and different types of cancer.

Questionnaire data also include information on key potential confounder and mediating factors, including age, sex, household income, level of education, smoking history, alcohol use, physical activity, height and weight, functional status, psychological distress, medical and surgical history and dietary habits. A full description of all the variables available in the 45 and Up Study together with basic summary statistics can be found elsewhere [7].

The SEEF study data, that include all the original variable in the 45 and Up Study plus a host of additional variables, were collected in 2010 from a random sub sample of the baseline 45 and Up Study cohort. One hundred thousands 45 and Up Study participants were mailed an invitation and the SEEF questionnaire. About 60,000 individuals joined the SEEF study by completing the consent form and the questionnaire and mailing them to the study coordinating center.

Our dependent variables are 5 binary variables denoting the presence or absence of the following chronic conditions at follow-up: heart disease, hypertension, diabetes, stroke, and cancer. These health conditions were self-reported and based on the responses to survey questions formulated as follows: “Has a doctor ever told you that you have [name of condition]?”.

Since individuals can develop any of those five conditions we consider the multi-class problem of predicting in which of the \(2^5=32\) combinations of conditions individuals will fall at follow-up. We report in Table 1 the size of each of the 32 classes in the SEEF data. Since some of the classes are very small and neither of the two methods out-performed the other in those cases, we have eliminated from our data the classes with fewer than 100 cases (outlined in bold in Table 1).

Table 1. Class size (bold font shows the classes which we removed)

The two main risk factors that we used as covariates were obesity and smoking status. Possible values of smoking status are “Not Smoking”, “Smoker” and “quit smoking”, which are derived from the combined answers to the following two questions “Have you ever been a regular smoker?” and “Are you a regular smoker now?”.

Obesity status was based on the values of the body mass index (BMI), which is the body weight in kilograms divided by the square of the body height in meters. We used the standard World Health Organization classification system to categories individuals as Underweight (BMI \(< 18.5\)), Normal (\(18.5\le \,\)BMI\(\,< 25\)), Overweight(\(25\le \,\)BMI\(\,< 30\)) and Obese (BMI \(\ge 30\)).

Additional covariates used in the analysis are the five chronic conditions at baseline, age category, gender, income, work status, private health insurance status, Body Mass Index (BMI) and smoking status.

The SEEF study includes many more variables (such as education, dietary habits or family history) that could be used in the analysis but we have restricted ourselves to this set because we found that adding more variables did not significantly improve the predictions.

Since individuals were recruited in the 45 and Up Study over a period of few years the interval between interviews is not always the same, resulting in follow-up data being collected between 2 and 4 years after baseline. Therefore we also included as a covariate the time to follow up, which on average was 2 and half years. The summary statistics for the covariates used in the model are shown in Table 2.

Table 2. Summary statistics of the SEEF Study. All quantities measured at baseline except when reported otherwise. Quantities in parenthesis are proportions.

3 Methodology

3.1 Multivariate Probit

Let us denote by \(Y_{i\alpha }^{(1)}\) the binary variable indicating the presence at follow-up of chronic condition \(\alpha \) for individual i, where \(i = 1, 2, \dots , N\) with \(N=60,000\), and \(\alpha = \{\text {heart disease},\text {diabetes},\text {hypertension},\text {stroke},\text {cancer}\}\). Let us also denote by \(Y_{i\alpha }^{(0)}\) the corresponding variable measured at baseline, and by \(\mathbf{Z}_i \in R^d\) a vector of other covariates measured at baseline. To simplify the notation we denote by \(\mathbf{Y}_{i}^{(1)}\) (\(\mathbf{Y}_{i}^{(0)}\)) the vectors whose components are \(Y_{i\alpha }^{(1)}\) (\(Y_{i\alpha }^{(0)}\)).

The MVP model is a latent variable model with the following specification:

$$\begin{aligned} \hat{\mathbf{Y}}_i^{(1)}&= \varGamma \mathbf{Y}_i^{(0)} + \varTheta \mathbf{Z}_i + \varvec{\epsilon }_i,&Y_{i\alpha }^{(1)} = 1~ \text {if} ~ \hat{Y}_{i\alpha }^{(1)} > 0,&\ 0 ~\text {otherwise} \\ \varvec{\epsilon }_i&\sim \mathcal{N}(0, \varSigma )&\nonumber \end{aligned}$$
(1)

where \(\varGamma \) and \(\varTheta \) are matrices of coefficients, of dimensions \(5 \times 5\) and \(5 \times d\) respectively, that need to be estimated. The key to the MVP model of Eq. 1 is the presence of the \(5 \times 5\) (unknown) covariance matrix \(\varSigma \). The off-diagonal elements of its inverse capture the correlations across chronic conditions and the fact that developing, say, heart disease and diabetes are not independent events. Prediction of the MVP model are performed probabilistically, by feeding samples of the multivariate normal distribution \(\mathcal{N}(0, \varSigma )\), one for each individual, in Eq. 1.

The estimation of the full MVP model is notoriously computationally intensive, although recent advances in computational methods [6] make it much more approachable. For the purpose of our experiments we have developed an approximation of the traditional method in which we use observed correlation among chronic conditions to approximate the matrix \(\varSigma \), which makes the estimation of the model much simpler. Since we have not observed deterioration in performance by using the approximate method, all the experiments performed for the production of this paper have been performed using the approximation rather than the full implementation.

3.2 Support Vector Machines

Support Vector Machines (SVMs) have been around the machine learning community for more than 20 years now [9], and for the sake of brevity we simply refer the reader to standard textbooks and references [10, 11]. SVMs have many attractive features, but one that should be emphasized in the context of this paper is that, unlike MVP, they do not rely on distributional assumption regarding the process that generates the data. Instead, SVMs relies on two key modeling choices: one is

  1. 1.

    the parameter (usually denoted by C) that controls the penalty associated with the misclassification of a data point;

  2. 2.

    the kernel, that is associated with the choice of the (possibly infinite dimensional) feature space onto which the input variables are projected [12]. For the purpose of this paper we have mainly experimented with polynomial kernels of the form \(K(\mathbf{x}_i,\mathbf{x}_j)=(1+\mathbf{x}_i \cdot \mathbf{x}_j)^p\), that are uniquely parametrized by the degree p.

SVMs were originally designed for binary classification problems, but several extensions exist that allow to deal with multi-class problems.

In the R package we use for the SVM implementation, Kernlab, there are several options for dealing with multi-class problems [13, 14]. We found that for this problem best results were obtained by using the “one vs one” approach, in which one trains \(K(K-1)/2\) binary classifiers (with \(K = 32\) in our case). Each of the classifier separates one class from another class, and in order to classify a new sample, all classifiers are applied and the class that gets the highest number of votes is selected. While it is not fully clear why the “one vs one” approach worked better than the alternatives (such as the “one vs all” approach [15]), the fact that in this particular application many of the events we are trying to predict are quite rare seems to play a role, since it can lead to very imbalanced data sets.

4 Experimental Results

4.1 Performance Evaluation Metrics

We used a 10-fold cross-validation approach to estimate the performance of the MVP and SVM methods. The full data sets was first randomly partitioned in 10 subsets of equal size (approximately 6,000 data points each). For each of the 10 replication trials we withhold one of the 10 partitions and use it for testing, while the remaining 9 partitions are used for training. For each of the 10 trials we compute 4 performance measures, and we report the average of the performance measures over the 10 replications.

As performance measures we report sensitivity and specificity, since they are the ones most commonly used in health studies, as well as accuracy and the F1 score. We report the definitions below, where TP, TN, FP and FN refer to the total number of true positives, true negatives, false positives and false negatives respectively.

$$\begin{aligned} \text {Sensitivity} = \frac{TP}{TP+FN} \end{aligned}$$
(2)

Sensitivity (or true positive rate, or recall) is important because it measures the ability to identify who is going to develop the disease.

$$\begin{aligned} \text {Specificity} = \frac{TN}{TN + FP} \end{aligned}$$
(3)

Specificity (or true negative rate) is important because it measures the ability to identify who is not going to develop the disease.

$$\begin{aligned} \text {Accuracy} = \frac{TP+TN}{TP+FP+TN+FN} \end{aligned}$$
(4)

Accuracy indicates how many samples are correctly classified overall. Accuracy can be misleading when the dataset is imbalanced. Therefore an alternative performance measure is the F1 Score, defined as:

$$\begin{aligned} \text {F1 Score} = 2\times \frac{pr}{p+r} \end{aligned}$$
(5)

where p is the precision and r is the recall (or sensitivity). Here precision is defined as the ratio of true positives (TP) to all predicted positives (TP + FP). Since the F1 score is the harmonic mean of precision and recall a high score is obtained when precision and recall are both high.

4.2 Results

The average of the performance measures over the 10 replication sets for Multiclass SVMs (MSVMs) and MVP are shown in Figs. 1 and 2. In Fig. 1 we report specificity and sensitivity for both methods. The key message of this figure is that while the specificity of the two methods are comparable, the sensitivity of MSVM is, on average about 12 % points better than the one of multivariate probit. Since sensitivities are in general not very high, this translates in a large relative improvement, of approximately 30 %.

Fig. 1.
figure 1

Comparison between MSVM and MVP using 10-fold cross-validation: sensitivity and specificity.

A similar pattern is seen on accuracy and F1 scores. With very few exceptions SVMs are more accurate than MVPs, although by not too much. That the difference is not great relates to the fact that in most cases the classification problem is quite imbalanced, for which accuracy is not a good performance measure. The F1 score shows show larger differences between SVMs and MVPs, which is not surprising since a component of the F1 score is the sensitivity of the method, that is greatly improved using MSVMs.

Fig. 2.
figure 2

Comparison between MSVM and MVP using 10-fold cross-validation: accuracy and F1 score.

5 Lessons Learned

Few lessons have emerged from this study. First of all, independently of which method we use, predicting who is going to develop some combination of chronic conditions in the near future, based on a handful of individual characteristics and the current chronic conditions, is quite hard. While maintaining specificity rates above 90 %, most of the sensitivity rates, obtained using MSVMs, fell within 50 % and 75 %.

In our experience including additional risk factors, such as diet or family history, will only lead to marginal improvements. What is likely to have a major impact on the predictive ability of any method is a more accurate measurement of people’s health status, such as actual results of pathology and imaging tests. Unfortunately it seems unlikely that data sets of this type, that in principle exist, can be made available to researchers any time soon.

This implies that it is crucial to make the best possible use of the current data, and that is why the choice of predictive model is highly relevant. Given that short-term predictions are of particular value in the process of making long-term predictions, which carry enormous policy implications, even a small improvement in accuracy could have serious policy implications. Put in this context, an average improvement in sensitivity of 12 % points, which translates into a 30 % relative improvement, is enormous.

We do not claim to have produced the best possible classifier, and it is likely that better methods can be devised, especially if they start taking advantage of prior information we have on the development of chronic conditions. However the main lesson learned is that the choice of predictive model can make a big difference. This seems particular important because in the area of health analytics we have not seen a high rate of adoption of methods such as SVMs or Deep Learning, which have proved to be extremely successful in a wide range of applications. Therefore we hope that this study will be a first step toward a broader use of methods that carry the potential of leading to large improvement over the status quo.