Introduction

Background

Several countries worldwide have implemented risk equalization (RE) into their (competitive) health insurance schemes. RE is a system of prospective risk-adjusted payments to compensate health insurers or health plans for predictable differences in individuals’ health care expenses. The principal goals of RE are (1) to achieve affordability of health insurance for high-risk individuals and (2) to mitigate financial incentives for insurers to engage in risk selection [51]. The latter is particularly relevant for competitive health insurance schemes with premium regulation as found in Belgium, Germany, Israel, the Netherlands, and Switzerland.

Schokkaert and van de Voorde [3638] have advocated that the calculation of risk-adjusted payments involves two steps. The first step focuses purely on the estimation of the prediction model, with the aim of explaining variation in individual health care expenses and to obtain accurate predictions, as far as is possible. Schokkaert and van de Voorde propose to include all relevant risk factors in the model, independent of whether the regulator desires compensation for those risk factors, in order to avoid (omitted-variables) bias in the predictions of individual expenses [3638]. In the second step, the estimated model is used to calculate risk-adjusted payments. This step involves normative choices by the regulator on the appropriateness of incentives for efficiency and risk selection and on risk factors for which insurers should be compensated. If a regulator does not desire compensation for a risk factor, the effects of this risk factor can be neutralized in the calculation of the risk-adjusted payments; e.g., by using the average value of this factor or any other value identical for all individuals in the population [3638]. These normative choices on appropriateness of incentives and on risk factors for which insurers should and should not be compensated may be decided differently in different countries. The empirical analysis of our study focuses purely on the first step of the calculation of risk-adjusted payments; i.e., on the estimation of the prediction model.

Over the past two decades, the predictive performance of the models used in RE has improved substantially as a result of the development of diagnostic-based and pharmacy-based risk adjusters [1, 12, 13, 16, 17, 19, 2224, 32, 33, 35], with over the past 5 years an increasing attention in the RE literature on the development of indicators for health status based on prior utilization or costs [e.g., 46, 47], and risk adjusters based on self-reported health or chronic conditions [e.g., 9, 13, 43]. Examples of diagnostic-based and pharmacy-based models are those used in Belgium, Germany, the Netherlands, and the US (Medicare). Several studies, however, have shown that even these sophisticated models do not adequately predict individual expenses, especially for high-risk individuals [4, 6, 48, 49]. Consequently, insurers receive risk-adjusted payments that are predictably too low for high-risk individuals and too high for low-risk individuals, which confronts insurers with incentives for risk rating and/or risk selection. Risk rating and risk selection both jeopardize affordability of coverage, accessibility to health care, and quality of health care [51, 52]. For example, insurers can select risks by offering less attractive benefits, or not contracting high-quality care, or providing poor services to high-risk subgroups [30, 51]. To mitigate incentives for risk rating and/or risk selection and to stimulate efficiency, further improvement of currently-used prediction models in RE is important.

Study objective and its contribution

This study endeavors to improve the prediction models used in RE by extending them with risk adjusters based on administrative information on costs and diagnoses from multiple prior years. Most of the currently used models use administrative data from one year to predict expenses in the next year. In 2012, the Dutch model has been extended with a risk adjuster for ‘multiple-year high costs’ [47, 49]. The Dutch model also includes risk adjusters based on diagnoses from previous year’s hospitalizations, the so-called diagnostic cost groups (DCGs), and on previous year’s use of prescribed drugs, the so-called pharmaceutical cost groups (PCGs). As studies have shown, the addition of risk adjusters based on costs and diagnostic information from multiple prior years may lead to more accurate predictions for individuals with systematically high expenses, such as the chronically ill [20, 21, 42, 47, 49]. Since most of the currently used models use ‘only’ information from one prior year and the Dutch model of 2012 uses in addition ‘only’ information on total prior costs (and not diagnoses from multiple prior years), it is expected that inclusion of additional risk adjusters using such information from multiple prior years could further improve models’ predictive performance.

The present study makes two important contributions to the RE literature. First, this study develops two models: one that uses diagnostic information from multiple prior years and another that in addition uses cost information from multiple prior years. Comparing the predictive performance of these models with those of several (proxies for) currently used models will indicate the extent to which these models could potentially be improved by using administrative information on diagnoses and costs from multiple prior years. Second, assessing the predictive performance of these two newly developed models will indicate to what extent these models adjust payments for differences in individuals’ expenses and so, whether these models would adequately compensate insurers.

This study uses an innovative approach. We used a very large administrative dataset covering almost the entire Dutch population (13.8 million observations) with lots of potentially relevant variables over multiple years. Using this dataset, we constructed a large array of multiyear cost-based and diagnostic-based adjusters, which have been used to develop two models. To specify the model using both cost-based and diagnostic-based adjusters, we used several variable-selection methods to select variables that contribute statistically significantly to models’ predictive power. All models estimated in this study are evaluated on an external dataset with health survey information.

Our empirical analysis is limited to estimating prediction models used in RE and assessing the predictive performance of these models. This analysis does not focus on normative choices involved with the calculation of the risk-adjusted payments in practice, nor does it focus on other qualitative criteria used for deciding on the design of the model used in practice, such as feasibility in terms of necessary data, redistributional effects, or vulnerability to manipulation [51]. This implies that we estimate several prediction models and examine the fit between predicted expenses and observed expenses. The closer predicted expenses are to observed expenses, the better the model adjusts for the differences in individuals’ observed expenses. It should be noted, however, that in practice a model with a better fit between predicted and observed expenses may not always be preferred over a model with a lower fit, because the payments to insurers or health plans do not have to (and cannot) adjust for all variation in individuals’ observed expenses. There is a considerable amount of variation in observed expenses due to acute events (i.e., random variation), which is unpredictable and for which insurers or health plans should not be compensated. In addition, there is variation in observed expenses due to risk factors for which the regulator desires compensation; the so-called compensation-type (C-type) risk factors (e.g., age, gender, need of health care related to health status), and risk factors for which compensation may not be desired; the so-called responsibility-type (R-type) risk factors (e.g., practice variation, inefficiency in provision of care, or moral hazard). Using information on costs and diagnoses from multiple prior years has been often debated in the RE literature and it has been applied in practice in only a (very) limited way for calculating risk-adjusted payments, because risk adjusters based on prior costs and/or prior utilization may reduce incentives for efficiency [e.g., 20, 21, 53, 54]. Following the approach of Schokkaert and van de Voorde [36, 37], we do not have to be concerned with these normative choices about C- and R-type risk factors in our empirical analysis, because we focus purely on improving the prediction model. Based on the models developed in this study, the regulator could decide which risk factors in the model are C- or R-type factors and then neutralize the effects of R-type risk factors in order to derive the risk-adjusted payments used in practice.

This study is relevant for all regulators and policy-makers in countries with a RE scheme or for those who want to incorporate RE in their health insurance scheme. Although this study uses administrative data from the Netherlands, regulators and policy-makers from other countries could learn from the findings of this study, because several models that are similar to currently used RE models have been evaluated. For this reason, the results of this study and the policy and methodological implications may be relevant for (most) countries with RE or those who are planning to implement RE. This study aims to indicate areas in which currently used prediction models in RE could be further improved.

The remainder of this article is structured as follows. First we describe the data and methods used in the empirical analysis, and then we present the results. Finally, we conclude and discuss these results, highlighting limitations of the study method, formulating points for further research, and addressing health-policy implications for regulators in countries with a RE scheme and for those who are planning to implement RE in their health insurance schemes.

Data and methods

Administrative data and health survey data

Two datasets were used for the empirical analysis. The first dataset contained individual-level administrative data for the Dutch population for the period 2006–2009. The sample analyzed in this study consisted of individuals who were enrolled, for a part or a full year, in each of the 4 yearsFootnote 1 (N = 13.8 million). For those individuals, we had the following three types of information for each year: (1) demographic information, including age, gender, region, source of income, and socio-economic status; (2) diagnostic information, including DCGs and PCGs, based on prior hospitalization and prior use of prescribed drugs respectively; and (3) cost information for several types of care. Total expenses are the sum of expenses on these different types of expenses. The administrative dataset is used for predicting individual expenses. The dependent variable in each of the estimated models is annual total health care expenses in the year 2009, which we refer to as prediction year t. Total expenses in year t were annualized and weighted by the fraction of the year the individual was enrolled.Footnote 2 For example, an individual who died after 3 months in year t and had 100 Euro expenses was given a weight of 0.25 and 400 Euro annual expenses. By applying this method, mean predicted expenses in year t equals mean observed expenses in year t. Table 1 shows some descriptive statistics. Mean total expenses in year t, t−1, t−2, and t−3 were 1,689 Euro, 1,639 Euro, 1,495 Euro, and 1,383 Euro, respectively. In the study population in year t, the average age was 41.5 years, 2.8 % of the individuals were classified into a DCG, and 17.7 % into a PCG, with 3.5 % having more than one PCG. In the Netherlands, individuals can be classified into only one DCG per year—the one with the highest follow-up costs—whereas individuals can be classified into more than one PCG in a year.

Table 1 Mean of total observed expenses and some risk characteristics in year t and prior years, in the administrative data from the Dutch population of insured over a 4-year period (N = 13.8 million). DCG Diagnostic cost groups, PCG pharmaceutical cost groups

The second dataset contained information on self-reported health from year t−1 and is derived from a Dutch household survey, the Permanent Survey of Living Conditions. This survey is conducted each year on a representative sample of the Dutch population by “Statistics Netherlands”.Footnote 3 It included detailed individual-level information on health status, household, and environment. We merged the administrative dataset with the survey data at the individual level using an anonymous, unique identification variable (N = 7,979).Footnote 4 The health status information was used to define subgroups in the population to assess the predictive performance at the subgroup level. Given the administrative data and the health survey data, the following four-step procedure is applied to examine the additional value in terms of predictive performance when cost and diagnostic information from multiple prior years are used to predict expenses.

Model estimation

Model 1–4: proxies for currently used models

As a first step, four models were tested to compare the outcomes of the two newly developed models to the others. All independent variables in these models are dummy variables defining different risk classes in the population. Model 1 includes an intercept only in order to examine the situation where payments are not risk-adjusted but simply equal the mean expenses in year t. Model 2 includes variables for age interacted with gender (number of variables = M = 39). This demographic model can be considered as one of the simplest models used in practice. Model 3 includes the same risk adjusters as the Dutch model of 2011, which are age interacted with gender, region, source of income interacted with age, socio-economic status interacted with age, and DCGs and PCGs based on utilization in year t−1 (M = 113). “Appendix 1” describes the specification of these variables. A more detailed description is well-documented elsewhere [46]. Model 4 includes the same risk adjusters as the Dutch model of 2011; i.e., model 3, plus a risk adjuster for ‘multiple-year high costs’ defined over the 3 years prior (M = 119). Table 2 gives a description of the independent variables in each of the estimated models. It should be noted that the variables in these four models resulted from choices by the Dutch regulator on the C- and R-type risk factors, which does not hold for the two newly developed models.

Table 2 Description of the independent variables for each estimated model. “Appendix 1” gives a more detailed description of the variables in models 1–4

Model 5: additional diagnostic information from three prior years

As a second step, we developed a model using diagnostic information from three prior years (Model 5). This model includes the same risk adjusters as model 3, extended with the DCGs and PCGs from year t−2 and t−3 (M = 179). The reference group in the model for the DCGs and PCGs in a certain year was the group of individuals without a DCG or a PCG, respectively, in that year.

Model 6: additional cost and diagnostic information from three prior years

As a third step, we developed a model using cost and diagnostic information from three prior years (Model 6). Using the administrative dataset, we defined 903 independent variables. We started with the same sets of variables as used in model 5; i.e., the set of variables included in model 3 (M = 113) plus the sets of dummy variables for DCGs and PCGs from year t−2 and t−3 (M = 66). Then, this model was extended with two sets of variables for prior costs. First, we defined dummy variables for percentiles of each type of expenses in year t−1, t−2, and t−3 (M = 694). We had information on the following types of expenses: hospital care, primary care, paramedical care, pharmaceuticals, durable medical equipment, transport in case of illness, dental care, obstetrical care, and maternity care. To define the percentiles, each type of expenses was divided into 20 risk classes, with each class representing 5 % of the population with positive expenses. The top 5 % of the distribution was further divided into five risk classes, with each class representing 1 % of the population with positive expenses. It is expected that these risk classes have strong predictive power, because being in the top 5 % of expenses in 1 year increases the likelihood of having high expenses in the next year(s) [15, 27]. All individuals with zero expenses per type of expenses were classified into a separate risk class, which was the reference group in the model for the set of dummy variables for percentiles per type of expenses. An individual was assigned to a risk class if the individual had expenses below or equal to the threshold value of the calculated percentile and higher than the threshold value of the previous percentile. Second, we added a set of continuous variables for each type of expenses in year t−1, t−2, and t−3 (M = 30). Dummy variables for percentiles of expenses as well as continuous variables were defined, because it was not known a priori which variables would have (more) predictive power.Footnote 5

Stepwise regression methods were used to select only those variables with statistically significant predictive power. With 903 variables, not all of them may be relevant for predicting individual expenses. Stepwise regression methods are useful for selecting a subset of variables for purposes of prediction or exploratory data analysis [14, 31, 44]. Stepwise regression methods use a forward/backward selection procedure, which implies that variables can enter and leave at each step of the procedure, starting with the variable that yields the largest contribution to the model in terms of the F-statistic. At each step, the variable with the most significant F-statistic is added and any variable in the model producing a non-significant F-statistic is dropped. The procedure stops when no variable outside the model can make a significant partial contribution to the model and no variable in the model can be dropped without a significant loss in predictive power. We used a significance level of 0.05 to test the F-statistics.Footnote 6 In our analysis, we focused primarily on prediction and not on hypothesis testing or causal interpretation to the effects of the independent variables. If the purpose were to draw statistical inferences about the effects of independent variables, the presence of (a high degree of) multicollinearity is of interest, because correlation among variables may influence the order of variable selection [14, 31]. For the purposes of prediction, however, multicollinearity is not of particular interest, because we are interested only in the predictive power of the model and not so much in which variables contribute (most) to the model.

A split-sample approach was applied in order to mitigate the influence of outlier observations and over-fitting of the data. The stepwise regression method selected a subset of variables that fit the data best. With this procedure, there is a risk of over-fitting the data when the same sample is used for both estimation of the model and prediction of expenses [3, 28]. Therefore, the total sample was split into a training sample and a validation sample. In the fourth step of the analysis, administrative data is merged with health survey data. To make maximum use of this data, we first assigned all respondents of the health survey to the validation sample, subsequently all other individuals were assigned randomly to either the training or validation sample, so that each sample contained approximately half of the total observations. This approach does not introduce selection bias and, therefore, both samples can be considered representative of the Dutch population that was enrolled during the study period.Footnote 7 All six models examined in this study were estimated on the training sample, and the coefficients of the variables in these models were used to predict individual expenses in the validation sample (model parameters of each estimated model can be provided on request to the first author).

All six models were assumed to be linear in the coefficients and included an intercept. The use of ordinary least squares (OLS)-models on untransformed data for predicting individual expenses has been discussed widely in literature, because OLS may not fit the distributional properties of health care expenses very well [5, 7, 10, 25, 26, 45]. We used an OLS-model on untransformed data to predict individual expenses for the following three reasons. First, OLS-models are easier to use and interpret than other models, such as two-part models (2PMs), generalized linear models (GLMs), or models based on (log-) transformed data. In the context of RE, this feature is highly important for regulators and policy-makers and therefore, OLS on untransformed data has been adopted widely in practice. Second, this study aims to examine the potential for improving currently used prediction models. To make a consistent comparison, we should estimate the models with the same estimation method as used in practice. Third, the analysis is based on a very large sample. Several studies have shown that when sample sizes are large (enough), OLS may provide the same model fit as more complicated models, such as 2PMs or GLMs [11, 18, 29, 34, 54]. Therefore, we expect that we would have found quite similar results with other estimation methods than OLS.

Model evaluation

As a fourth step, the predictive performance of the estimated models was assessed and compared at both the population and subgroup level. By doing so, it is possible to examine how well the models predict expenses for the total sample and for specific subgroups in the population of insured. At the population level, the adjusted R-squared (R 2) and mean absolute prediction error (MAPE) were calculated for each model. The MAPE was calculated as the average of the absolute differences between predicted expenses and observed expenses. Higher R 2-values and lower MAPE-values indicate a higher predictive performance of the model, since predicted expenses are closer to observed expenses.

Models’ predictive performance at the subgroup level was assessed by the mean prediction error (MPE). The MPE was calculated as the average of the difference between predicted expenses and observed expenses, i.e., it is the average under- or over-prediction per individual in a subgroup. A model tends to perform better on subgroups defined by information from the training-sample than information from the validation sample and on subgroups matching (or highly correlated with) the risk cells of the model [8]. To perform a stronger test, we used an external dataset in the form of the health survey sample merged to the validation sample in order to evaluate models’ predictive performance on subgroups (N = 7,979). The MPE on survey subgroups can provide a good indication of the extent to which models compensate insurers for differences in expenses between subgroups. This method is also applied in other studies [41, 42, 48, 49].

General demographic risk characteristics in the dataset used for the model evaluation at the subgroup level are comparable to those of the training- and validation-sample, providing evidence for the representativeness of the health survey respondents for the Dutch population (Table 3). However, there are three exceptions: the prevalence of young individuals with an age under 24, individuals with an age older than 25 but younger than 44 years, and individuals living at a home address with more than 15 persons. The first group is slightly overrepresented in the survey data while the second and third are underrepresented. The main reason for the latter is that the health survey is targeted mainly at individuals living in private households. Institutions, mental and nursing homes are excluded from the sample selection. Therefore, our results may not be representative for the subgroup of institutionalized individuals.Footnote 8

Table 3 Descriptive statistics for individuals in the administrative data and the respondents of the health survey who matched successfully with the administrative data

Specifically, information on self-reported health status, (long-term) diseases and conditions, and health care utilization was used to construct 45 subgroups. These subgroups were defined in such a way that they include a relatively large proportion of high-risk individuals (e.g., chronically ill). These subgroups are comparable to those defined by van Kleef et al. [48, 49], Stam [40] and Stam and van de Ven [41]. The subgroups were identified by questions like: “How do you rate your health status?”, “Do you have one of the following diseases?”, “Do you have problems with performing a certain daily activity?”. Most subgroups were defined by ‘yes/no’-questions. “Appendix 2” describes the definition of subgroups based on more than one question and/or more answer categories.

A (two-sided) t-test was applied to test whether the MPEs on subgroups are statistically significantly different from zero. To make this test relevant, the overall MPE for each model in the survey sample has to equal zero. This was, however, not the case; e.g., Table 3 shows that mean total observed expenses differs from mean total predicted expenses of model 1 in the survey sample. Therefore, the MPEs for each model in the survey sample were corrected as follows: individual observed expenses were raised by a factor equalling average predicted expenses in the survey sample divided by average observed expenses in the survey sample. These corrected MPEs were used to assess models’ predictive performance on subgroups and to test the statistical significance of the MPEs.

Results

Predictive performance at the population level

The results in Table 4 show the predictive performance of the estimated models at the population level in terms of the adjusted R 2 and MAPE. These results show that the predictive performance of a model increases as more risk adjusters are added. Model 2 (i.e., a demographic model) has a R 2-value of 5.38 % and a MAPE of 1,808 Euro. As risk adjusters are added to model 2; i.e., socio-economic status interacted with age, source of income interacted with age, region, and DCGs and PCGs from one prior year, the R 2-value increases to 23.96 % and the MAPE-value reduces to 1,554 Euro. Adding risk adjusters for ‘multiple-year high costs’ to model 3 further increases the R 2-value to 28.54 % and the MAPE-value further reduces to 1,475 Euro. The R 2-value of model 5 is 24.84 % and the MAPE-value is 1,537 Euro, so that this model has a lower predictive performance than model 4. Based on this we may conclude that if model 3 is the benchmark and we aim to improve the predictive performance of the model, it may be more effective to include a risk adjuster based on cost information from multiple prior years than to include a risk adjuster based on diagnostic information from multiple prior years. When the model already uses a risk adjuster based on cost information from multiple prior years (model 4), its predictive performance could be further improved by approximately 8 percentage points in R 2-value by using additional cost and diagnostic information from three prior years. For models 1, 2, and 3 there is an even larger potential for improving the predictive performance by using cost and diagnostic information from multiple prior years. Consistent with other studies [2, 20, 54], these results confirm the predictive power of cost and diagnostic information from multiple prior years.

Table 4 Adjusted-R 2 and mean absolute prediction error (MAPE) of the estimated models

Sensitivity analysis: specification model 6

To test the robustness of model 6, we performed a sensitivity analysis by changing the specification of the variable-selection procedure used for estimating this model. We estimated five alternative models. First, we re-estimated model 6 with two other variable-selection procedures than stepwise regression, namely backward elimination (alternative model 1) and forward selection (alternative model 2) [14, 44]. Second, we re-estimated model 6 with a significance level of 0.01 instead of 0.05 in order to examine whether the choice of significance level for entry and deletion of the variables influenced models’ predictive performance (alternative model 3). Third, we re-estimated model 6 with the risk adjusters of model 3 as a starting point to which the stepwise regression method could add and delete variables based on cost and diagnostic information from three prior years; i.e., the risk adjusters of model 3 could not be deleted from the model. With this specification we examined whether it matters in terms of predictive performance if risk adjusters as used in practice are already included in the model. This procedure was applied twice, with one model using a significance level of 0.05 (alternative model 4) and the other using a level of 0.01 (alternative model 5). The predictive performance of these five alternative models appeared to be similar to those of model 6 in terms R 2-values and MAPE-values; i.e., the R 2-values of the alternative models ranged from 35.976 to 35.978 %, with the R 2-value of model 6 being 35.976 % and the MAPE-value of the alternative models ranged from 1,348.87 Euro to 1,349.06 Euro, with the MAPE-value of model 6 being 1,348.96 Euro. These results indicate the robustness of the specification of model 6 as applied here for predicting individual expenses.

Predictive performance at the subgroup level

Based on analyzing the MPE-values of all models for the 45 subgroups, for 14 subgroups model 6 has reduced the MPE-value to such an extent that it is not statistically significantly different from zero, while all other models have produced statistically significant MPE-values, which means that adding cost and diagnostic information from three prior years has (statistically significantly) improved models’ predictive performance (Table 5). For 7 subgroups all estimated models have produced statistically significant MPE-values, implying that adding risk adjusters based on cost and diagnostic information from three prior years is not sufficient to adequately predict expenses for these subgroups (Table 6). Finally, for 24 subgroups the MPE-value was not statistically significantly different from zero for one of the proxies for currently used models (models 1, 2, 3, or 4), implying that adding cost and diagnostic information from multiple prior years cannot further improve models’ predictive performance statistically significantly (“Appendix 3”). In the remainder of this section, we focus purely on the first two types of results, i.e., on Tables 5 and 6.

Table 5 Subgroups for which the mean prediction error in year t is not statistically significantly different from zero for model 6. In this study, the prediction year t is 2009. The column of total expenses presents the corrected total expenses. Total expenses and predicted expenses in the sample with health survey information were corrected in such a way that the average MPE on the total survey sample is zero. This was done to test the statistical significance of the MPEs from zero. By doing so, the column with total expenses in year t minus the column with the MPEs of model 1 results into the same number for each group, namely total average expenses in year t (1,689 Euro)
Table 6 Subgroups for which the mean prediction error in year t is statistically significantly different from zero for model 6. In this study, the prediction year t is 2009. The column of total expenses presents the corrected total expenses. Total expenses and predicted expenses in the sample with health survey information were corrected in such a way that the average MPE on the total survey sample is zero. This was done to test the statistical significance of the MPEs from zero. By doing so, the column with total expenses in year t minus the column with the MPEs of model 1 results into the same number for each group, namely total average expenses in year t (1,689 Euro)

For all defined subgroups expenses in year t are (far) above average expenses in the total sample in year t, indicating that all subgroups contain (as expected) a relatively high proportion of high-risk individuals. Further, for most subgroups the MPE has a negative value, which means that the models under-predict expenses for these subgroups. These under-predictions indicate that expenses for the complementary subgroups (i.e., the low-risk individuals) are over-predicted. Notice that positive MPE-values imply that the model over-predicts expenses for this subgroup. When interpreting the results in Tables 5 and 6, it should be taken into consideration that the same individual may occur in multiple subgroups.

The results in Table 5 show that models with more risk adjusters produce more accurate predictions at the subgroup level than models using less risk adjusters. For example, model 1 in Table 5 shows substantially negative MPE-values for all subgroups, all of them being statistically significantly different from zero. Compared to model 1, models 2, 3, and 4 further reduce the MPE-values for all subgroups, but statistically significant MPE-values still remain. Just as the performance at the population level, model 5 has a lower predictive performance than model 4. If model 3 is used as a benchmark, adding diagnostic information from three prior years improves the predictive performance for all subgroups: e.g., for individuals with OECD limitations in moving (age ≥12 years), individuals with a low score on the SF-12 scales (age ≥12 years), individuals with limitations in daily activities (age ≥55 years), or individuals who reported two or more diseases (age ≥12 years). Model 4, however, further improves the performance for all subgroups in Table 5, which is due to the inclusion of a risk adjuster for ‘multiple-year high costs’. Further, model 6 outperforms all other models on all subgroups in Table 5. The MPE-values on all subgroups in Table 5 have been reduced to such an extent that they are no longer statistically significantly different from zero. These results demonstrate that cost information from multiple prior years may be more effective in increasing models’ predictive performance than diagnostic information from multiple prior years, given the dataset used in this study and the use of model 3 as the benchmark. Based on our results, we may conclude that using both cost and diagnostic information from multiple prior years may provide (statistically) significant improvements of models’ predictive performance for several subgroups in the population.

However, the results in Table 6 show that model 6 (i.e. using cost and diagnostic information in addition to the Dutch model of 2012) still under-predicts expenses for several subgroups. Under-predictions (statistically significantly different from zero) remain for individuals who reported a poor general health status (age ≥12 years), one or more long-term diseases (age ≥12 years), a myocardial infarction or other serious heart disease (age ≥12 years), psoriasis (age ≥12 years), other long-term disease or disorder than migraine or other serious headaches, vascular constriction in stomach or legs, asthma or chronic bronchitis, chronic eczema, dizziness with falling down, or serious bowel disorders longer than 3 months (age ≥12 years), three or more self-reported diseases or disorders (age ≥ 12 years), or use of complete dentures (age ≥16 years). Apparently, these subgroups are not accurately identified by the additional risk adjusters based on costs and diagnoses from hospitalizations and use of prescribed drugs in three prior years.

Discussion

Methodological limitations and points for further research

The empirical analysis and the data used to illustrate the potential for improving the predictive performance of models in RE using cost and diagnostic information from multiple prior years have certain drawbacks. First of all, even though a large dataset is used, which is representative for the Dutch population, the dataset is restricted to a time period of three prior years. It is expected that cost and diagnostic information from more than three prior years could further improve models’ predictive performance [15, 21, 27]. It is relevant to investigate how many years of lagged cost and diagnostic information would still have statistically significant predictive power in the estimation year. Such research may provide useful insights into the persistence of under-predicting expenses for certain high-risk groups in the population, which can indicate methods to further improve currently used prediction models in RE.

Second, our empirical analysis focused on improving models’ predictive performance by using cost and diagnostic information from multiple prior years. However, other information not available in our dataset may also be useful for further improving the models, such as outpatient diagnostic information [50]. Our analysis is restricted in this sense and in practice there may be (many) more methods to further improve the prediction models. A relevant question is which other types of information than cost and diagnostic information from multiple prior years are available and how this information could be used to further improve the prediction models.

Third, the predictive performance of the model may depend on the statistical method chosen to predict individuals’ expenses. We confined ourselves to the method used in practice, i.e., OLS, even though other statistical methods have been advocated in the literature [e.g., 5, 7, 10, 25, 26, 45]. To our knowledge, there is no empirical evidence on the predictive performance of transformed and/or nonlinear models based on millions of observations, compared to those of OLS models on untransformed data. Further research could provide pertinent evidence by investigating whether models’ predictive performance can be further improved using a method other than those currently used in practice using large datasets (i.e., datasets with millions of observations). Moreover, further research is needed to investigate whether there is an additive or multiplicative relationship between risk adjusters based on cost and diagnostic information from multiple prior years. In this study, only additive relationships have been examined. Such research may result in further improvement of prediction models used in RE.

Health-policy implications

As Schokkaert and van de Voorde [3638] have advocated, the calculation of risk-adjusted payments used in practice involves two steps. In the first step, the model is estimated with the aim of explaining variation in individual health care expenses and to obtain predictions that are as accurate as possible. The second step uses the estimated model to calculate risk-adjusted payments, which involves normative choices by the regulator on the appropriateness of incentives for risk selection and efficiency and on risk factors for which insurers should and should not be compensated. The empirical analysis of this study was restricted to the estimation of the prediction model. Consequently, we may not be able to draw definitive conclusions as to the extent to which currently used RE models can be improved in practice. Our findings should be interpreted bearing the following points in mind.

First, the extent to which currently used RE models can be improved may depend on the degree to which the risk adjusters satisfy the criteria of fairness, appropriateness of incentives for efficiency and selection, and feasibility. In our empirical analysis, we did not consider the fairness-criterion of the used risk adjusters in the two newly developed models: i.e., we did not distinguish risk factors for which the regulator desires compensation (C-type risk factors), and risk factors for which the regulator does not desire compensation (R-type risk factors) [37]. According to the approach of Schokkaert and van de Voorde [3638], both C- and R-type risk factors should be included in the model in the first step of the calculation, instead of omitting these R-type risk factors, in order to avoid (omitted-variables) bias in the predictions. In the second step, the effects of these R-type risk factors can be neutralized, e.g., by using the average value of this risk factor or using the same value for all individuals in the population. Following this approach, regulators could use the models developed in this study by deciding which risk factors in the models are C- or R-type factors in order to neutralize the effects of R-type risk factors in the second step, and thus derive the risk-adjusted payments used in practice. Note that the choice of C-type and R-type risk factors involves a value judgment by regulators, which may be decided differently in different contexts by different regulators.

Note, however, that if regulators decide not to use cost and diagnostic information in the second step of the calculation of the risk-adjusted payments, because using this information may reduce incentives for efficiency, incentives for risk selection may increase compared to using this information in the calculation of the risk-adjusted payments. This trade-off between reducing incentives for risk selection and maintaining incentives for efficiency is inevitable as long as there are no better alternatives for risk adjusters than using cost and diagnostic information from multiple prior years. In the event that the regulator considers the incentives for risk selection to be too large compared to the reduced incentives for efficiency, information on costs and/or diagnoses from multiple prior years can be used in the second step of the calculation of the risk-adjusted payments. In this case, restrictions could be placed on the risk adjusters based on prior costs and/or diagnoses in order to mitigate the reduction in incentives for efficiency. Examples are the thresholds on the ‘Defined Daily Dose’ for the PCGs and the requirement for the risk adjuster ‘multiple-year high costs’ that an individual is in the top 15 % for at least two of three consecutive years.

An advantage of the use of cost and diagnostic information from multiple prior years is that this type of information is, in most situations, already available in the administrative files of (Dutch) insurers or health plans. This means that it does not require a large additional administrative burden for collecting this information. In most situations, regulators and policy-makers could relatively easily improve the predictive performance of currently used models by including cost and diagnostic information from multiple prior years.

Conclusions

This study has explored the potential for improving the prediction models used in RE in competitive health insurance schemes. This study makes two important contributions. First, it shows that the predictive performance of currently used models can be improved by extending these models with risk adjusters based on cost and diagnostic information from multiple prior years. Compared to the Dutch model of 2012, the predictive performance of the model in terms of R 2-value could potentially be improved by 8 percentage points at the population level. At the subgroup level, models’ predictive performance could also potentially be improved: e.g., improvements can be expected on groups of individuals who reported OECD limitations on moving, a low score on one of the SF-12 health scales, who have limitations in daily activities, or who have two or more diseases or (chronic) conditions. The second contribution of this study is that even a model using additional cost and diagnostic information from multiple prior years does not adjust for all differences in individuals’ health care expenses, implying that there are still under-predictions (that are statistically significantly different from zero) for certain high-risk subgroups in the population: e.g., under-predictions remain for groups of individuals with a poor general health status, who have three or more diseases or (chronic) conditions, or who use complete dentures.

To conclude, currently used RE models do not adequately compensate insurers for predictable differences in individuals’ health care expenses, which faces insurers with incentives for risk rating and risk selection, both of which jeopardize affordability of coverage, accessibility of health care, and quality of care. This study shows that these incentives for risk rating and risk selection could potentially be (substantially) reduced by further improving the predictive performance of the model using cost and diagnostic information from multiple prior years, but that even using this information does not remove these incentives completely. The extent to which currently used RE models can be improved in practice to the level of the two models developed in this study may differ across countries, depending on the availability of data, the method chosen to calculate risk-adjusted payments, the value judgment by the regulator about risk factors for which the model should and should not compensate insurers, and the trade-off between risk selection and efficiency.