Introduction

Estimating savings of 12 pilot projects on integrated care

In response to the fragmented delivery of health services, the Belgian government launched in 2016 the Plan ‘Integrated Care for a better health’. This program supports 12 regional pilot projects during a 5-year period (2018–2022) to initiate a development towards integrated care, i.e. a process of coordination and delivery of care services so that patients receive seamless and continuous care, tailored to their needs [1]. Participation was voluntary and the 12 pilots have been selected after a nationwide call.

The conceptual model of the national plan is based on the 'Chronic Care Model’ [2, 3] and focusses on effective and high quality care for the chronically ill. It is proposed that medical and non-medical care for patients with chronic illnesses is transformed from acute and reactive care to a proactive, planned approach. This should be achieved through effective care provided by a care team and the stimulation of self-management. The model further emphasizes intensive use of patient registries and supporting information technology. After wide consultation of Belgian stakeholders, the model was translated into a policy framework for the care of the chronically ill and a national plan with a series of concrete actions. As other integrated care programs, the actions represent the transformation processes required to achieve a ‘Triple Aim’, i.e. the improvement of health outcomes and quality of care while being cost-effective or even cost-saving at the same time. The pilot projects serve as test regions to identify best practices that can be implemented nationwide in the future.

Each pilot consists of a local consortium that comprises primary, secondary and social care partners, hospitals, local authorities and health insurers operating in the region. Together they cover 21% of the Belgian population. During a preparatory period of 2 years (2016–2017), the selected pilots have designed detailed action plans that address the local health needs and cover a wide range of components of integrated care, such as prevention and patient empowerment, care coordination and continuity, shared patient records and innovative ways of financing. The actions officially went into operation from July 2018 onwards.

Besides a subsidy for management,Footnote 1 the pilot projects would receive part of the savings they realize to reinvest in new actions. As in other shared savings programs [4, 5], savings are calculated by comparing average health care spending in the project to a benchmark that is based on predicted levels of spending. For reasons that will become clear later on, the actual payments of savings have been suspended, but the relevant estimations are still calculated as part of the evaluation of the projects. Changes in health care spending are evaluated at the level of the total population living in the pilot region. The calculations are intended to reflect short-term effects, i.e. the impact on spending conditional on time-varying individual characteristics such as health status and socio-demographic features. Prices are fixed at the national level, so savings on spending can only be realized by changes in the utilization of care or by substitution from higher- to lower-cost services, and not by lower prices. Most providers are reimbursed in a fee-for-service system. New ways of financing through bundled payment were planned by some projects but have not yet been implemented. Short-term savings may be achieved through, for example, eliminating medically unnecessary or harmful care (‘low value services’), avoiding duplication of services, or pooling scattered financial governance at the regional level (financial integration). Preventive interventions, follow-up care, or patient self-management may also lead to a reduction of avoidable care in the short term.Footnote 2 Because these arrangements require new investments, the impact of integrated care on total health spending at population level is unclear. Integrated care is expected to have also long-term cost effects, in particular through improved population health. Such effects, however, will be less observable in the 5-year evaluation period of the projects.

The Intermutualistic Agency (IMA), a collaborative data and research center set up by all the sickness funds, has been asked by the health authorities to estimate these savings. As in similar studies, the analysis is based on a difference-in-differences approach. Changes in mean spending in the project are compared with risk-adjusted mean changes in a control group, where the latter represent the expected outcome if the project had not been implemented (see, e.g., [6,7,8,9,10,11,12,13,14]). Using this approach, savings were estimated with large standard errors and were found to be very sensitive to high-cost patients. Therefore, they were considered unsuitable as a basis for actual payments to projects.Footnote 3 Part of the explanation of the large standard errors is the strongly skewed distribution of health care costs with a small fraction of the population accounting for a large share of total health spending and mean costs well above median cost. Variance is even more concentrated in the high-cost cases, and exacerbated by the presence of large outliers. As a result, even sophisticated models with large sets of control variables are faced with a portion of high costs not well captured by the model [15]. In this paper we compare different solutions to address this problem.

Treatment of high-cost patients

A common method to reduce model sensitivity to skewness is the logarithmic transformation of the dependent variable. However, the literature on the comparative performance of models for health care concludes that this transformation is less suitable for estimating individual’s health care costs. See [16] for an overview. Using Monte Carlo analyses with hypothetical cost data, or a quasi-Monte Carlo design with empirical data, these studies find that the log regression model generally performs poorly in terms of bias and predictive accuracy, while linear OLS is among the better performing models, in particular for very large sample sizes as the one we use in our study. Moreover, the transformed data provide results on the log scale, while empirical applications as ours typically require results that are expressed in terms of actual costs. New problems arise with the retransformation of costs to the original scale, in particular if there is heteroskedasticity in the data on the transformed scale [16, 17]. Other problems associated with the logarithmic transformation of the dependent variable are the treatment of observations with 0€ costs and the consistent estimation of the confidence intervals. See, e.g., [16, 18]. In line with these findings, we use a linear model estimated by Ordinary Least Squares. It is the commonly used method in recent evaluations of similar savings programs. See, e.g., [10,11,12,13,14].

Another statistical approach to high-cost observations is to identify them as outliers and reweight them or remove them from the analysis. This indeed leads to more robust estimates but also to an underestimation of potential savings at the top end of the distribution. Integrated care initiatives focus largely on the chronically ill, which are often patients with high levels of spending. Moreover, short-term savings by integrated care are mainly expected from the reduction of unnecessary and avoidable care, which is likely to be concentrated among high-cost patients. Reweighting or removing high-cost patients entails the risk of excluding a significant portion of the costs affected by the initiatives and could thus lead to a systematic underestimation of realized savings.

A less extreme solution is the use of a top-coded dependent variable. Individual expenditures are then limited to a threshold: if a patient's costs exceed a certain threshold, the portion above the threshold does not count. Hence, the top-coded dependent variable is the minimum of actual spending and the threshold. Because of its flexibility and transparency, top-coding has often been used in shared savings models to define spending benchmarks [4, 5, 19]. The thresholds are usually based on the overall cost distribution and are an element of negotiation with the partners of the program. The thresholds can be chosen on the basis of insights in the specific care groups that lead to high spending and as a function of the aims of the program. The treatment of high-cost cases is hence not imposed solely on the basis of statistical considerations, but is motivated by an understanding of the actual scope of the program.

Top-coding has the effect of lowering the impact of high-cost cases, while at the same time retaining these patients in the model. It is preferred to dropping high-cost observations altogether, because at least a portion of the high costs is often predictably associated with specific conditions captured by the covariates. Yet, while less drastic than simply removing observations, top-coding still lowers the incentives for cost control. This implies that we are facing a trade-off here: removing or top-coding high-cost cases may increase robustness of the estimates, but at the same reduce incentives for cost reduction.

We explore another method to weaken this trade-off. To increase reliability and precision of the estimates without excluding a meaningful group of patients from the analysis, we remove from the cost variable those health services that lead to high-cost cases but are unlikely to be affected by the program. The approach bears analogy to the modification of the dependent variable in certain empirical risk-adjustment models [15]. If the program cannot be held responsible for the growth of certain healthcare costs, they should not be part of the dependent variable. The identification of such cost will depend on the global context of the health insurance system and on the scope of the actual integrated care programs. In our case, we identified two distinct types of health services that entail high spending and are out of the scope of the projects’ interventions. Excluding these costs from the dependent variable should sharpen the incentives for cost reduction because the resulting model better reflects the costs that projects can affect.

From a theoretical point of view, increasing reliability and precision of the estimates by tailoring the cost variable to the scope of the program is preferable to removing or top-coding high-cost cases. The aim of this study is to statistically evaluate the different approaches. We compare their performance on measures of precision and model fit, and the stability of the coefficient estimates.

Method

We use a linear difference-in-differences (DiD) model to compare average changes in spending in the pilot project and the control group. The pre-intervention period (t = 0) includes observations in 2015, 2016 and 2017, and the postintervention year is 2018 (t = 1). Observations in the three pre-intervention years are given a weight of 1/3. The model is designed to capture short-term effects and to evaluate whether the pilot project has reduced spending conditional on time-varying health conditions and socio-demographic status.

For each project, we estimate the following linear regression:

$${y}_{it}={\beta }_{0}+{\beta }_{1}{P}_{it}+{\beta }_{2}{T}_{it}+ {\beta }_{3}\left({P}_{it}\times {T}_{it}\right)+{\sum }_{k}{\delta }_{k}{x}_{it}^{k}+{\sum }_{k}{\gamma }_{k}\left({x}_{it}^{k}\times {T}_{it}\right)+{\varepsilon }_{it},$$
(1)

where \({y}_{it}\) is spending of individual i in period t. \({P}_{it}\) is a dummy variable which equals one if i is in the pilot project at time \(t\) and zero otherwise. The coefficient \({\beta }_{1}\) is the pilot fixed effect. It captures group-specific effects, i.e. (non-random) differences between the pilot and the control group in the pre-intervention period. \({T}_{it}\) is a dummy variable for the postintervention year. \({\beta }_{2}\) captures time-specific effects, i.e. aggregate factors that cause changes in \({y}_{it}\) regardless of the group to which the individual belongs. The interaction term \(\left({P}_{it}\times {T}_{it}\right)\) is equal to one only for individuals in the pilot in the postintervention year. The coefficient of interest, \({\beta }_{3}\), is an estimate of the effect of the pilot project on changes in spending between the pre- and postintervention periods. We estimate the DiD model with OLS, the common method for the estimation of risk-adjusted health care costs, especially for large datasets (see also [10,11,12,13,14]). This setting allows us to use the maximum number of control observations, to flexibly include control variables and different time periods, and to obtain direct estimates of the effect of the projects on spending.

The DiD model makes the identifying assumption that in the absence of treatment, the average outcomes for the treated and control groups would have followed parallel trends over time [20, 21]. This requires us to define a set of covariates, \({x}_{it}^{k}\) (k = 1, … n), that affect spending and are related to the intervention. With the inclusion of these covariates, the model should control for a sufficient proportion of variation in the data. We include a rich set of individual-level variables that risk-adjustment models have shown to be good predictors of total healthcare spending and which control, at the same time, for differences in the composition of the population of the pilot and control groups. The set of covariates is inspired by the Belgian risk-adjustment model for the financing of the health insurers [22]. In addition to interactions of age classes and sex, we include socio-demographic and socioeconomic variables, geographical variables, a dummy for mortality, a set of covariates indicating care dependence, and a large number of morbidity indicators. Care dependence and morbidity indicators capture chronic conditions. Care dependence variables are based on insurance entitlements related to disability, long-term nursing care, physiotherapy and revalidation, and morbidity indicators are pseudo-diagnoses based on ambulatory and inpatient drug prescriptions, some in combination with use of care and facilities for specific conditions. The full list of variables is provided in Appendix 1. Finally, we include municipality-level fixed effects to control for urbanization and medical supply in the area.

As a validity check of the main model assumption, we compare changes in spending in the pilot and risk-adjusted control groups in the three pre-intervention years. We also include interactions between the covariates and the time dummy, \(\left({x}_{it}^{k}\times {T}_{it}\right)\) and keep only the significant ones using backward elimination. The time-interactions allow the effect of a covariate to differ between the pre- and postintervention periods by an amount that is different from the aggregate time effect, \({\beta }_{2}\). The error term \({\varepsilon }_{it}\) represents random variation in spending.

Our baseline model has total (uncorrected) health care expenditures as the dependent variable and makes use of all the available observations. We then compare the results of removing high-cost cases and of excluding some specific cost categories from the dependent variable. Finally, we check the gains obtained by top-coding the dependent variable.

Data

We use claims data from 2015 through 2018 of all residents covered by the Belgian mandatory health insurance. The data are collected by the health insurers and compiled by the Intermutualistic Agency in a permanent database for research. It includes all reimbursed services as well as patient socio-demographic characteristics and social security related data. The pilot samples consist of all individuals living in the region of the pilot project, while the control sample is the same for all projects and includes all individuals not living in any of the pilot regions. Because we are interested in changes in mean spending across all regional levels of care, no patients are excluded, and individuals are assigned to a project or control group on a year-by-year basis.Footnote 4 The project-level estimates are thus based on an unbalanced panel of about 8.9 million observations per year.

The dependent variable is total spending reimbursed under the compulsory health insurance at the individual-year level. Reimbursements represent about 92.5% of total spending in the Belgian compulsory health insurance system, while out-of-pocket payments represent 7.5%.Footnote 5

Table 1 shows summary statistics of the study sample in the pre-intervention period (2015–2017). The number of individuals in a pilot project varies between 96,000 and 362,000, and is 193,000 on average. The control group is 45 times as large and includes more than 8.7 million individuals. Mean spending per project in the pre-intervention period was 2037€ on average across projects, and increased by 6% in 2018. Both measures differ considerably between projects, with projects substantially below and others above the corresponding figures in the control group. The distribution of spending is strongly right skewed. Mean spending is almost 4 times the median. Moreover, mean growth is considerably above median growth, suggesting that increases in spending are disproportionally situated in higher cost classes. The top percentile accounts for 19% of total spending and the top 5 percentile accounts for 50% of total spending (see Table 3).

Table 1 Summary statistics

Results

Baseline model

We evaluate the effect of the pilot projects on spending using a difference-in-differences model estimated with OLS, as described in the Method section. We first show baseline results without preprocessing the data, that is, no cost groups are excluded from the dependent variable. Figure 1 shows the coefficient estimates and 95% confidence intervals of \({\beta }_{3}\) for each pilot-level model. The results range between cost reductions of 30€ and increases in spending of 20€. Coefficient estimates are not significantly different from zero for ten projects and significantly negative for the two others. The lack of significant results may be due to the early stage of the projects or to the fact that the \({\beta }_{3}\) coefficients are estimated with relatively large standard errors of 12.5 on average across the 12 project-level estimations. The width of the confidence intervals ranges from 34€ to 64€, or 1.5 to 3.8% of mean spending.

Fig. 1
figure 1

The effect of the pilot projects on spending. Baseline model before data preprocessing. Coefficient estimates and 95% confidence intervals of \({\beta }_{3}\)

Excluding the high-cost cases

A traditional solution to increase precision is to exclude high-cost cases from the sample. Specifications 1.a and 1.b in Table 2 show the impact on the estimates when the top 0.1% and the top 1% high-cost cases are excluded, respectively. The table reports two fit statistics to evaluate model performance: the adjusted R-square as the fraction of the total variance in the dependent variable explained by the model, and the mean absolute error as the average of the absolute errors.Footnote 6 Because we are interested in precise and stable estimates, we also report the standard errors of the estimated coefficient, \({\beta }_{3}\), and the absolute parameter change compared to the baseline specification. Each cell provides the average calculated across the 12 project-level estimations. For the last measure, we also report the minimum and maximum value across the projects.

Table 2 Model performance before data preprocessing

After excluding the top 0.1% and the top 1% high-cost cases, model fit measures strongly improve and standard errors are reduced on average by 32 and 52%. The last column, however, shows that the results are highly sensitive to these high-cost cases: parameter estimates substantially fluctuate across the specifications, at least for part of the projects.

Spending by care type among high-cost patients

To understand which cost groups cause unexplained variation at the top end of the distribution, we analyze the distribution of spending by type of care. The aim is to identify those cost groups that lead to high-cost patients, but which might be out of the scope of the projects. If the projects cannot be held responsible for the growth of certain costs, these costs should not be part of the dependent variable. Table 3 shows a summary of our findings. It reports the distribution of spending by care type in patients with lower costs (the 1st–94th percentile) and in the 5 highest percentiles. The latter account for 50% of total spending. Three groups of care services stand out in the spending by high-cost patients: Medication, Nursing home stays and, to a lesser extent, Nursing care and physiotherapy (services outside nursing homes). Together, these services represent 56–79% of spending in higher cost groups, but only 36% among low-cost patients.

Table 3 Distribution of spending by care type among low-cost (1st–94th percentiles) and high-cost patients (95th–99th percentiles)

Nursing home stays and nursing care are at the core of the transformation of health care towards integrated care. These services play a crucial role both in horizontal initiatives of the pilot projects (such as care coordination after hospitalizations, multidisciplinary care of the chronically ill, or reinforcement of primary care to avoid unnecessary complications), as well as in dedicated actions for dependent patients (such as prevention coaches or care trajectories to allow older people to live longer at home). Excluding these services would clearly ignore an important part of the effect of the pilots on spending.

The link between the use of ‘medication and medical devices’ and the integrated care actions is more ambiguous. On the one hand, several initiatives of the projects focus on medication compliance and avoidance of medication overuse. On the other hand, treatments for specific conditions, often with new and expensive medicines, clearly fall outside the scope of the projects in the short term. We identified two subsets that the integrated care projects are unlikely to affect. The first is ‘Medication outside the hospital lump sum’. It concerns a list of ATC-codes for which the Belgian health authorities have laid down that the cost should not reduce their use. They are considered to be “medicinal products whose active substance is of major importance in medical practice, taking into account therapeutic and social needs and the innovative nature of the ingredient”. These medicines are kept outside the lump sum that patients pay for hospital drugs because the high cost could otherwise strongly inhibit their prescription.Footnote 7 Spending on these medicines has been increasing very sharply in recent years. Between 2016 and 2018, the increase was 20%, compared to a 6% growth of total spending. The pilot projects certainly cannot be held responsible for disparities in the growth of these costs, it even would run counter the official health care policy if the projects would reduce the use of these medicines. Keeping these costs in the sample entails the risk that changes in spending are wrongly attributed to the effect of the program. A same reasoning can be followed for a second subset of medication that we labeled as life-saving medical devices. It includes implants, implantable cardiac defibrillators, human body material, and dialyses. Even if the projects invest in prevention and avoidance of complications that could reduce the need of this material in the long term, they are unlikely to affect their use in the short term. Both medication outside the hospital lump sum and life- saving medical devices represent a disproportional share of spending among high-cost patients, i.e. 10–36%, while only 5% among low-cost patients. We exclude these cost subsets from the dependent variable, as they are problematic from both a statistical and context-specific viewpoint.

Redefining the dependent variable

The results of redefining the cost variable are shown in Table 4. In specification 2, we do not exclude any patients but we remove the two subsets of medication discussed above, i.e. medication outside the hospital lump sum and life-saving medical devices, from the dependent variable. The model fit measures strongly improve compared to specification 1—baseline, and the standard errors are reduced by 42%. Remarkably, the improvements are of the same magnitude as in specification 1.b, where the top 1% patients are excluded (see Table 2). This indicates that the excluded cost groups are indeed responsible for a large part of the unexplained variation at the top end of the distribution.

Table 4 Model performance after data preprocessing

Figure 2 illustrates the estimates and the standard errors for the \({\beta }_{3}\) coefficients of the 12 projects. The point estimates of the coefficients in specification 2 substantially differ from specification 1. Although standard errors are greatly reduced, they remain relatively large. The width of the confidence intervals still ranges from 19€ to 37€, or 0.9–2.6% of mean spending. We will now check whether top-coding can remedy this problem.

Fig. 2
figure 2

Comparison of different specifications of the model. Specifications as in Tables 2 and 4. Coefficients and 95% confidence intervals of \({\beta }_{3}\)

The additional gains from top-coding

In specifications 2.a and 2.b (see Table 4), the dependent variable is top-coded at the value of the 99.9th and 99th percentiles of the cost distribution, respectively.Footnote 8 If a patient's cost exceeds this threshold, it replaced by the threshold. The last two rows of Table 4 show that top-coding moderately improves model fit measures and further reduces standard errors by 8 and 11%, respectively. More importantly, we find that coefficient estimates remain fairly stable. They differ from specification 2 results by only 2€–4€ on average, with a maximum of 8€. Figure 2 provides a visual representation of this stability. Since top-coding affects the incentives for cost reduction at the highest end of the distribution, it could be considered not to implement it.

Validity check: the parallel trends assumption

As a validity check of the parallel trend assumption, we compare pre-intervention trends between the pilot and control groups. The graphs in Appendix 2 (Fig. 3) show mean spending in the pilot projects and risk-adjusted spending in the control groups from 2015 to 2018. Although the same control group is used for all projects, risk-adjusted mean spending differs for each pilot-level estimation, both in the level of spending and spending growth. The included risk-adjusters adequately control for differences in the composition of the populations of the pilot and control groups: the graphs show that pre-intervention trends in both groups are very similar. Moreover, the results for 2018 show no clear deviation from these parallel trends. Indeed, even if standard errors are strongly reduced after redefining the dependent variable, we still find that the estimated effects of the pilot projects on spending are not or only just significantly different from zero (see Fig. 2).

Discussion

In principle the exclusion from the dependent variable of expensive cost items that cannot be influenced by the integrated care projects does not only improve the statistical properties of the estimation exercise but is also justifiable on conceptual grounds, as it can sharpen the incentives to control costs. The identification of the costs to be excluded is not an easy task, however. Integrated care models are complex interventions and the mechanisms underlying the development of costs are difficult to trace. Excluded costs, e.g. high-cost medicines, will almost never be fully independent from the services of integrated care models. If the interdependence is substantial, the exclusion of expensive drugs could also reduce the incentives to control costs. The trade-off between statistical robustness and strong incentives will not fully disappear in this approach.

It is therefore obvious that the decision about which costs to exclude needs solid justification. For the reasons explained earlier, we believe that in our specific exercise we can safely assume that the indirect effects are not very large, especially in the short-term frame of our evaluation. Yet, our approach should be tested and further refined as more data on the results become available.

The results based on preprocessed data suggest that one project realized significant savings, while for the others, estimates are not significant or only just different from zero. It remains unclear whether these results are acceptable as a reliable basis for payments to projects. The lack of clear significant results may be due to the early stage of the program. Savings in similar programs have been found to increase over time [10,11,12, 14]. Moreover, because prices for care services in Belgium are fixed, savings on spending can only be realized by lower utilization of services and by substitution from higher- to lower-cost services, which is likely to take more time than savings by lower prices. The short time period is one of the main limitations of this study.

Consequently, one reason for the small effects we observe may be that, in a transition period towards integrated care, realized savings and increased costs may coexist, for example by strengthening primary care and multidisciplinary collaboration to ensure better patient follow-up or avoid duplication of services. An interesting way to move forward would be to investigate specific groups of care services that are mostly affected in either direction in this initial stage and to explicate their impact on the total cost development. DiD findings such as those shown in Fig. 3 offer a natural starting point for such a deeper analysis of the differences in the cost developments between the projects compared and the risk-adjusted control group. At this stage, our results are not sufficiently robust for a reliable analysis of this kind, but we plan to focus on the decomposition of costs in later work.

Finally, more effort should be invested in collecting information on quality of care and health outcomes. The quality and health data will be less skewed so that estimates of the program effects are likely to be more robust [10,11,12]. This will make it possible to analyze whether the integrated care projects have successfully achieved the intended Triple Aim of improving quality and health while reducing costs.

Conclusion

Calculations of the effect of care programs on spending that are used to reward organizations which are found to realize savings, place high demands on the precision of the estimates. Relatively small standard errors in statistical terms often represent substantial budgets when expressed in monetary values. It may lead to undue payments to less-performing organizations and undercompensation of others. The calculation of savings and losses in the Belgian integrated care program is, as in shared savings programs, based on comparisons of mean spending growth in the program and a control group. These means are highly sensitive to high-cost patients. A purely statistical approach, which reweights high-cost cases or removes them from the analysis may lead to a systematic underestimation of realized savings because short-term cost effects of integrated care are likely to be achieved in patients at the top end of the distribution. Our proposed approach combines statistical treatment of high-cost cases with careful preprocessing of the data, incorporating insights in the health care costs that integrated care actions in the specific context of the program are likely to affect. Care services that the program cannot be held responsible for are removed from the dependent variable. Removal of these services strongly improves model fit measures and leads to more precise estimates of savings and losses without excluding a meaningful group of patients from the analysis. We further check the sensitivity of the results to high-cost cases by re-estimating the model with a dependent variable that is top-coded at some level. Top-coding further increases the robustness of the estimated coefficients, but the additional gains are modest. Moreover, the results are fairly stable to top-coding. Since top-coding reduces the incentives for reducing costs and for focusing on the chronically ill patients with high expenditures, it may be considered to refrain from it.

We find that defining a dependent variable that is tailored to the context of the program yields goodness-of-fit measures that are similar to simply excluding the top 1% cost cases from the sample. On the other hand, changes in spending are still estimated with relatively large standard errors. The width of the confidence intervals ranges between 0.9 and 2.6% of mean spending. These may be relatively precise results from an economic evaluation perspective, but they correspond to large amounts of money at project-level. Even if significant savings were estimated in subsequent years, random factors would lead to substantial differences in actual payments.