Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Prediction of glucose changes in type 1 Diabetes Mellitus has received a considerable amount of scientific and commercial interest over the last decade. In large, the driving force behind this surge in research can be explained by the recent advances in sensor technology [101], and the thereto attached promises and hopes of closed, or semi-closed, loop control of diabetic glucose dynamics. Predicting models play a key role in many of these concepts—providing the essential simulation tool in MPC-oriented closed loop arrangements of an artificial pancreas [20], or as a component in a decision support system—providing predictions directly to the user [82].

However, insulin-dependent diabetic glucose dynamics are known to be subject to time-shifting dynamics. Considering this, as well as the vast number of models developed in the literature, it is unclear if a single model can be determined to be optimal under every possible situation. This raises the question whether it is more useful to use one of the models solely, or if it is possible to gain additional prediction accuracy by combining their outcomes. Accuracy may be gained from merging, due to mismodeling or to changing dynamics in the underlying data creating process, where a single model capturing the system behavior may be infeasible, e.g., for practical identification concerns. Thus, by an ensemble approach, robustness and performance may be improved.

In this chapter, a novel merging approach—combining elements from both switching and averaging techniques, forming a ‘soft’ switcher in a Bayesian framework—is presented for the glucose prediction application.

1 Related Research

In this section some related research to glucose prediction and model merging are presented.

1.1 Models for Glucose Prediction

Models of glucose dynamics for predictive purposes can mainly be divided into two categories; physiologically-oriented models and data-driven black-box approaches. The latter sometimes incorporate physiological sub models of insulin and glucose infusion following insulin administration and meal intake, but the main part of the dynamics stem from the statistically derived relationships.

The development of physiological diabetic glucose modeling started with the simple linear models of [2, 12], aiming at describing the relationship between glucose and insulin utilization. Following these efforts, the slightly more complex, and well-established, minimal model [10] was suggested as a means to estimate insulin sensitivity from an intravenous glucose tolerance test (IVGTT). Detailed models of the glucose metabolism; separating insulin and non-insulin dependent glucose utilization, incorporating models of hepatic balance, renal clearance, glucose rate of appearance following meal intake, insulin pharmacokinetics, and in some cases pancreatic insulin synthesis and release, have surged since then.

The transport of rapid-acting insulin from the subcutaneous injection site to the blood stream has been described in quite a few models of insulin pharmacokinetics. Most of these are linear compartment models, and reviews can be found in [72, 104]. This phenomenon has generally been considered independent to the metabolic interaction, and thus separated as a stand-alone model. In [104], 11 different models (10 compartment models and the model from [9] were fitted to empirical meal test data from seven type 1 patients using rapid-acting bolus insulin. A third-order compartment model, with local degradation of insulin at the injection site (modeled as a Michaelis–Menten relationship), turned out to be the best choice, according to the Akaike criterion [53], and this may serve as a typical example of how the insulin kinetics have been be modeled.

The corresponding flux of glucose from the intestines following a meal intake, has been modeled with different approaches. There is evidence that gastric emptying, to some extent, is dependent on current glucose level, see, e.g., [94], but this relationship has not been incorporated in any model so far. Thus, the digestive process is also considered as a stand-alone model, without dependencies to the glucose metabolism. Two models have been widely used; the models by [24, 62]. In [62], the model consists of single compartment with fixed limited gastric emptying rate constant, and with a duration dependent on the meal size. Earlier work on models of glucose rate of appearance during an OGTT [22] and mixed meal test [21] formed the basis for the model in [24]. Here, a third-order nonlinear compartment model was used, and also in this case, the gastric emptying rate was limited dependent upon the amount of ingested carbohydrates.

Turning to general models of glucose metabolism, a sparse fourth-order linear model, with physiological interpretation of the state variables, was suggested in [92], with six tunable parameters. The original model was validated on data from intravenous experiments involving diabetic dogs. Thereafter, the model has been both reduced, and extended to include exercise load, and to also consider oral hyperglycaemic agents. The model order is still four, but the number of tunable parameters has been reduced to five, and incorporated into a decision support system (DSS) called KADIS [93].

In [62], a simulation model based on the insulin kinetics from [9], and including hepatic balance (described by a look-up table), peripheral and insulin-independent glucose utilization (Michaelis–Menten like relationship), renal clearance and the meal digestion model from the same paper (described above), was presented. Overall, the model contains only two tunable parameters, the rest are considered patient invariant. Later, the freely downloadable educational simulation software AIDA [61] was developed using this model. The system was validated on a set of 24 subjects with parameter convergence achieved in 80 % of the cases [60].

Another simulation model, that has been turned into an advisory system, is the DIAS model [48]. Especially noteworthy of this model is the nonlinear model of the hepatic balance [6], fitted to tracer literature data, and the model extension to include the delayed hypoglycemic effect of alcohol intake [81]. The model was incorporated into a prototype eHealth tool called DiasNet [52], with a central server-based web service, which also communicates over the cellular network with the user’s mobile application implemented on a smartphone. The system has been tested in a small field trial, but was mainly evaluated on overall data acquisition, transmission and application usability aspects, and not on results concerning model performance.

A large model with 19 tunable parameters was proposed in the Sorensen thesis [95], a model often used as a verification tool to assess different control approaches, e.g., [34]. The web-based educational simulation model GlucoSim [3] has been developed based on another thesis [84]. Generally, these models are difficult to fit to an individual person, and may lack structural identifiability. This makes them unsuitable for predictive purposes, but synthetic subjects may be created for simulation studies.

Currently, the most influential simulation model is the University of Virginia and Padova University (UVa/Padova) model described in [23, 24], which has been accepted by the Food and Drug Administration of the U.S. (FDA) to be used as a substitute for animal trials in preclinical trials of closed-loop development [57]. To this purpose, 300 artificial subjects have been derived from estimated parameters from population studies, and used in, e.g., [59]. This model is based upon the classical minimal model [10], and the glucose rate-of-appearance model in [21]. The population data for estimating the 300 artificial subjects were derived using the triple-tracer protocol described in [8].

In [89], the minimal model was augmented with additional states to include the dynamical interaction between free fatty acids and the insulin and glucose compartments. The model parameters were partly fixed, and partly identified using experimental data, and showed reasonable resemblance to data. In [90], the model was used, together with the gastric emptying function taken from [62], to fit the model against data from one mixed meal consumed by normal subjects, with good correspondence.

The limitation of the classical minimal model to provide consistent estimates of insulin sensitivity, when different insulin concentrations arise during an IVGTT, was addressed in [83]. Modifications to the model was suggested to incorporate the saturation effect of insulin on insulin-dependent glucose utilization [69, 88], as well as a saturation effect on insulin transport from the plasma to the interstitial compartment. Generally, the saturation effect is not pronounced at insulin infusion levels of most insulin-dependent diabetic patients. However, the critically ill may often experience reduced insulin sensitivity, and are treated with intensive insulin treatment with abnormal insulin levels to maintain normoglycemia, thereby reducing mortality and morbidity outcome [102]. Thus, for the purpose of improved glycemic control of the critically ill in Intensive Care Units (ICU), this model was picked up in [64]. Thereafter, the table-based protocol SPRINT, which acts as a decision support in the manual infusion control for the ICU personal, was derived [18]. This approach has been successfully validated in a large study covering 371 subjects, achieving a very tight glucose control [17].

Another extension of the minimal model was proposed in [28], by incorporating effects of physical exercise by adding parameters, which increase insulin sensitivity, insulin-independent glucose utilization and insulin clearance during exercise, to the model. The model has not been evaluated empirically. Also the UVa/Padova model has been extended to cover physical activity in [67], based on the model in [15]. The model links elevated heart rate to increased insulin sensitivity and insulin-independent glucose utilization. In [15], the model was fitted to data from a hyper-insulinemic clamp test, including a 15-min exercise period (50 % VO2max ), for 21 type 1 subjects, with a weighted mean square estimation error of 7.7 mg/dl (unclear how the weights were chosen).

Yet another ambitious extension with 19 parameters, whereof 10 are subject to identification, and including modeling of the circadian rhythm was given in [37]. In [38], the model was validated by simulation comparisons on two data set of six and nine type 1 patients with excellent results (RMSE about 1 mmol/L), however, apparently without cross-validation.

Before leaving the minimal model, the work in [54] needs to be commented. Here, the minimal model, extended with a simple pharmacokinetic compartment model for the insulin kinetics and a compartment meal model of the same type as in [105], was tested on closed-loop data from a trial involving 10 type 1 subjects. Intraday variations of the model parameters related to the insulin sensitivity, hepatic balance and insulin-independent glucose utilization was allowed over three different sections of the day. Also in this case, the model was validated without crossvalidation, but with an impressive average simulation prediction error (RMSE about 16 mg/dl).

A simpler model, with only five tunable parameters, is the Hovorka model [50], later extended and altered for the critically ill in [51]. The former model has been validated for predictive capacity on 15 subjects with a RMSE of 3.6 mg/dl for a prediction horizon of 15 min. Parameter estimates were retrieved recursively from a sliding data window using a Bayesian approach. This model is also used extensively for MPC-oriented closed-loop validation in a simulation environment, including a cohort of 18 virtual patients [103]. 8 out of the 18 parameter sets have been derived from experimental data, and the rest from so-called informed prior distributions. The model has also been used, e.g., in the evaluation of PID control in [39], which also make use of the Sorensen [95] and the minimal model [10].

Data-driven models have been investigated on CGM time-series alone, or by considering inputs as well. The meal sub models of [24, 62] are furthermore often used as input generating components in data-driven models to approximate the glucose flux input from the gut following a meal intake. Here, the focus has been prediction for the purpose of early hypoglycemic detection, e.g., to be used for alarm triggering in CGM devices, or temporary insulin pump shut-off, as well as establishing models suitable for model-based control.

Time-series analysis by Auto-regressive (AR) models started with [14], who evaluated the basic underlying assumptions concerning stationarity and auto-covariance that AR modeling is based upon, concluding that diabetic data generally is non-stationary, but highly auto-correlated, thus recommending the models to be recurrently re-estimated. Following this, AR and ARMA models were developed in [97, 99] using glucose data from a recently diagnosed type 1 diabetic. In [96], first-order recursive AR models were investigated for 28 subjects using a low-pass filtered CGM signal from the GlucoDay CGM system. The results indicate that hypoglycemia can be detected by the model 25 min before the CGM signal passes the same threshold. Another example of recursive AR and ARMA models of third order, incorporating a change detection feature for more rapid parameter re-estimation when large changes in the dynamics are detected, is found in [35]. The models were evaluated for 30 healthy, 7 glucose-intolerant and 25 type II diabetic subjects, with less than 4 % mean Relative Average Deviation (RAD) and almost no values in D or E zones of the Clarke Error Grid [19] for the 30-min predictions in comparison to the CGM Medtronic Gold reference [68]. Contrary to the above, the authors of [42] claim that a generic patient- and time-invariant AR model of order 30 can be identified from any patient and used for glucose prediction for any other patient. Very promising results were achieved in [41], where the model was evaluated for three different datasets, each utilizing a different CGM device, and the patient cohorts included both type I and type II diabetes. The prediction error was on average, in terms of RMSE, less than 3.6 mg/dl for a 30-min prediction, with negligible delay, and with 99 % of the paired prediction-reference points in the A and B zones of the p-CGA. However, these results were achieved by filtering the CGM signal in both training and test data using a non-causal filter, removing the high frequency components. In [65] the causality aspect of the input filtering was addressed. The AR model, here reduced to order 8 after model complexity considerations, was reformulated as a linear model with a Kalman filter, and the filter parameters were adjusted to account for the filtering of the CGM signal. For evaluation purposes, the reference was however still filtered in the same non-causal way as before. Using this approach on the same data set as in [41], yielded more moderate results with an average prediction error of 16 mg/dl, and a 9-min lag for the 20-min prediction.

Algorithms specifically developed for hypoglycemic detection have also been proposed. In [76], a Kalman filter approach was suggested, estimating the states corresponding to the interstitial glucose level, and the first and second derivative thereof, i.e., rate of glucose change and acceleration. In [75], this method was evaluated for 13 hypoglycemic clamp data sets. Using a hypoglycemic threshold of 70 mg/dl, the sensitivity and specificity were 90 and 79 %, respectively, with unknown alarm time. Combining three different methods for hypoglycemic detection with the ARMA model of [35], data from insulin-induced hypoglycemic tests for 54 type 1 subjects were evaluated in [33]. With a hypoglycemic threshold of 60 mg/dl, sensitivity of 89, 88, and 89 % and specificity of 67, 74, and 78 % were reported for each method, respectively. Mean values for time to detection were 30, 26, and 28 min.

A short-coming of the AR models and the algorithms above is the lack of input-output relationship, excluding them from being used in a model-based control framework. A natural extension to the AR concept is to include external inputs, transforming the model to an ARX model. This type of model has been considered in, e.g., [40], where both batch-wise and recursively identified patient-specific ARX models have been analysed for nine patients with a mean 30-min prediction error RMSE of 26 mg/dl. In [16] both ARX, ARMAX and state-space models were investigated using different identification methods for 30-, 60-, 90- and 120-min prediction for nine Montpellier patients from a trial in the DIAdvisor project [30]. The best performance was achieved with the ARX and the ARMAX models. The ARX model gave a standard deviation of the prediction error of 17, 34, 46 and 56 mg/dl on average for the 30-, 60-, 90- and 120-min prediction, respectively. The corresponding results for the ARMAX model were 16, 30, 39 and 44 mg/dl.

Another type of transfer function model, cast in the continuous domain, was approached in [78], where it was evaluated for nine type I subjects on separated meal and insulin intakes. Model parameters were determined both heuristically and by least-squares estimation. The carbohydrate and insulin impacts of the model, i.e., the steady-state rise and drop of glucose following these intakes, were further compared to the corresponding practically used estimates of these factors. No independent prediction validation was given. This model was later evaluated in a control framework in [79], where two data sets were created by the Hovorka (4 subjects) and Padova (10 subjects) simulation models. Here, the model could approximate the simulated data very well, with a 3-h look-ahead prediction error of 26 mg/dl reported. A very similar model structure was used in [55], the difference being a time delay changed into a time lag. In this chapter, breakfast glucose excursion prediction was addressed for 10 patient datasets collected in the DIAdvisor project [30]. For each patient, model parameters were determined by constrained least squares for two breakfast meals and cross-validated on a third breakfast, with an average fit value of 42 %.

Neural network (NN) models have been shown to be a competitive approach in [26], where a recurrent NN model was compared against an AR and an ARX model on a 30 patient dataset, retrieved from the Padova simulation model. Here, the NN clearly outperformed the competing models with an average RMSE of 4.9 mg/dl versus 29 mg/dl (AR) and 26 mg/dl (ARX) for the 45-min prediction. Apart from meal and insulin information, emotional factors, hypoglycemic/hyperglycemic symptoms and lifestyle/ activities, were collected in an electric diary and used as inputs in the NN model of [77]. Training was performed on a dataset from 17 patients, and performance was evaluated on 10 patient data sets not included in the training set, with a RMSE of 44 mg/dl for the 45-min prediction.

A fully connected three-layer (5, 10, 1 neuron per layer) NN, with sigmoidal transfer functions in the first two layers and a linear for the output block was used in [80]. No insulin nor meal information were used, but the concurrent and previous CGM values, up to 20 min back, acted as inputs. The model was evaluated on two datasets with different CGM devices (Abbott Freestyle and MedTronic Guardian). Three subject data sets were used for training for each patient group and were thereafter excluded from the validation data. For the six Guardian patients and the three Abbott Freestyle patients the performance was 10, 18 and 27 mg/dl for the 15, 30 and 45-min prediction, with a delay of around 4, 9, and 14 min for upward trends, and 5, 15, and 26 min for downward trends. In [106], the linear predictor from [96] worked in a cascade-like configuration with a NN model, which also used both CGM and glucose flux from the meal model of [24] as inputs. Training and validation was done using 15 patient records from the 7-day free-living conditions set of the DIAdvisor DAQ trial [30]. The NN was trained and validated on 25 time series, each one of 3 days, selected so as to ensure a wide variety of glycemic dynamics. Nine daily profiles, containing several hypo- and hyperglycemic events, were used to test the NN with an average of 14 mg/dl and a 14 min delay for the 30-min prediction. For an assessment on 20 simulated subjects using the UVa/Padova model, the corresponding metrics were 9.4 mg/dl and 5 min. Both insulin and carbohydrate digestion were considered by incorporating input-generating sub models in the support vector machine of [45]. Additionally, exercise-induced glucose and insulin absorption variations were also considered as inputs by processing a metabolic equivalent (MET) estimate, derived from a SenseWear body monitoring system (BodyMedia Inc.) used in the study, in a model by [91]. The NN was trained individually for seven type 1 patients with RMSE of 9.5, 16, 25 and 36 mg/dl for the 15, 30, 60 and 120-min prediction.

Examples of other machine learning approaches that have been considered, include, e.g., support vector regression [44] and random forests [46]. Both techniques were evaluated on the same dataset of 27 type 1 patient records from free-living conditions collected within the METABO project [43]. The recorded insulin injections as well as the meal intakes were fed into compartment models to provide estimated profiles of plasma insulin and glucose rate of appearance. Furthermore, physical activity, estimated from a body monitoring system, and the time of the day were also added as input variables. The predictive performance of each method was assessed for a 15-, 30-, 60- and 120-min ahead prediction horizon with impressive results. The reported RMSE of the support vector regression for these predictions horizons was 5.2, 6.0, 7.1 and 7.6 mg/dl, whereas the random forest method managed slightly worse; 6.6, 8.2, 9.3 and 10.8 mg/dl.

Further reviews can be found in, e.g., [7, 45, 66].

1.2 Model Merging

Merging models for the purpose of prediction has been developed in different research communities. In the meteorological and econometric communities regression-oriented ensemble prediction has been a vivid research area since the late 1960s, see, e.g., [31, 85].

Also in the machine learning community, the question of how different predictors or classifiers can be used together for increased performance has been investigated, and different algorithms have been developed, such as the bagging, boosting [13] and weighted majority [63] algorithms, and online versions of these [56, 74].

In most approaches the merged prediction \(\hat{y}_{k}^{e}\) at time k is formed by a linear weighted average of the individual predictors \({\hat{\mathbf{y}}}_{k}\) .

$$\hat{y}_{k}^{e} = {\mathbf{w}}_{k}^{{\mathbf{T}}} {\hat{\mathbf{y}}}_{k}$$
(1)

It is also common to restrict the weights w k to [0,1]. The possible reasons for this are several, where the interpretation of the weights as probabilities, or rather Bayesian beliefs, is the dominating. Such restrictions are however not always applicable, e.g. in the related optimal portfolio selection problem, where negative weight (short selling) can reduce the portfolio risk [32].

A special case, considering distinct switches between different linear system dynamics, has been studied mainly in the control community. The data stream and the underlying dynamic system are modelled by pure switching between different filters derived from these models, i.e., the weights w k can only take value 1 or 0. A lot of attention has been given to reconstructing the switching sequence, see, e.g., [47, 73]. From a prediction viewpoint, the current dynamic mode is of primary interest, and it may suffice to reconstruct the dynamic mode for a limited section of the most recent time points in a receding horizon fashion [4].

Combinations of specifically adaptive filters has also stirred some interest in the signal processing community. Typically, filters with different update pace are merged, to benefit from each filter’s specific change responsiveness, respectively steady state behaviour [5].

Finally, in fuzzy modeling, soft switching between multiple models is offered using fuzzy membership rules in the Takagi–Sugeno systems [100].

Merging of predictions in the glucose prediction context has previously been investigated in terms of hypo- or hyperglycemic warning systems. In [25], the glucose prediction from a so-called output corrected ARX predictor (see the reference for method details) was linearly combined with the prediction from an adaptive recurrent neural network model. The balancing factor for the linear combination was determined offline by optimizing a trade-off between hypo- and hyperglycemic sensitivity, effective prediction horizon and the false alarm rate. This factor was determined individually for each patient and the balance may be different for hypo- and hyperglycemia. A different mechanism was used in [27]. Here, five different predictors were running simultaneously, and the hypoglycemic alarm was based upon a voting scheme between the individual predictors. If a number of the five predictors exceeded the predefined hypoglycemic threshold value an alarm was raised. Both studies indicated an improvement in alarm sensitivity compared to the individual predictors.

2 Problem Formulation

As seen from the review above, many different approaches to glucose modeling and predicting have been established. These methods may each be more suitable to specific conditions for the glucose dynamics, and improvements in robustness and prediction performance may be achieved by combining their outcomes, as indicated from the studies from the hypo-/hyperglycemic alarm systems. Such a situation is depicted in Fig. 1, where two prediction models try to capture the true glucose level. In different situations, each predictor is clearly outperforming the other and is capable of providing good estimates of the true glucose level. However, as the conditions change the performance deteriorates, and instead the other predictor is more suitable to rely upon. Given this informal background a more formal problem formulation is now outlined.

Fig. 1
figure 1

Example of when merging between different predictors may be beneficial. Initially the model corresponding to the red dash-dotted prediction resembles the true reference (black solid curve) best, but as the conditions change the prediction given by the other prediction model (blue dashed curve) gradually takes the lead

A non-stationary data stream \(z_{k} : \, \{ y_{k} ,u_{k} \}\) arrives with a fixed sample rate, set to 1 for notational convenience, at time \(t_{k} \in \left\{ {1,2, \ldots } \right\}.\) The data stream contains a variable of primary interest called \(y_{k} \in {\mathbb{R}}\) and additional variables u k . The data stream can be divided into different periods \(T_{{S_{i} }}\) of similar dynamics \(S_{i} \in S = \left[ {1, \ldots ,n} \right],\) and where s k  ∈ S indicates the current dynamic mode at time t k . The system changes between these different modes according to some unknown dynamics.

Given m number of expert q-steps-ahead predictions, \(\hat{y}_{{\left. {k + q} \right|k}}^{j} ,j \in \left\{ {1, \ldots ,m} \right\}\) of the variable of interest at time t k , each utilizing different methods, and/or different training sets; how is an optimal q-steps-ahead prediction \(\hat{y}_{{\left. {k + q} \right|k}}^{e}\) of the primary variable, using a predefined norm and under time-varying conditions, determined?

3 Sliding Window Bayesian Model Averaging

Apart from conceptual differences between the different approaches to ensemble prediction, the most important difference is how the weights are determined. Numerous different methods exist, ranging from heuristic algorithms [5, 100] to theory based approaches, e.g., [49]. Specifically, in a Bayesian Model Averaging framework [49], which will be adopted in this chapter, the weights are interpreted as partial beliefs in each predictor M i , and the merging is formulated as:

$$p\left( {\left. {y_{k + q} } \right|D_{k} } \right) = \sum\limits_{i} {p\left( {\left. {y_{k + q} } \right|M_{i} ,D_{k} } \right)p\left( {\left. {M_{i} } \right|D_{k} } \right)}$$
(2)

where \(p\left( {\left. {y_{k + q} } \right|D_{k} } \right)\) is the conditional probability of y at time t k+q given the data, \(D_{k} \;:\;\left\{ {z_{1:k} } \right\}\) received up until time k, and if only point-estimates are available, one can, e.g., use:

$$\hat{y}_{{\left. {k + q} \right|k}}^{e} = {\mathbb{E}}\left( {y\left. {_{k + q} } \right|D_{k} } \right)$$
(3)
$$\quad \quad \quad \quad \quad \quad = \sum\limits_{i} {{\mathbb{E}}\left( {\left. {M_{i} } \right|D_{k} } \right){\mathbb{E}}\left( {y\left. {_{k + q} } \right|M_{i} ,\;D_{k} } \right)}$$
(4)
$$= {\mathbf{w}}_{k}^{T} {\hat{\mathbf{y}}}_{k}$$
(5)
$${\mathbf{w}}_{k}^{\left( i \right)} = {\mathbb{E}}\left( {\left. {M_{i} } \right|D_{k} } \right)$$
(6)
$$p\left( {{\mathbf{w}}_{k}^{\left( i \right)} } \right) = p\left( {\left. {M_{i} } \right|D_{k} } \right)$$
(7)

where \(\hat{y}_{k + q}^{e}\) is the combined prediction of \(y_{k + q}\) using information available at time k, and \({\mathbf{w}}_{k}^{\left( i \right)}\) indicates position i in the weight vector. The conditional probability of predictor M i can be further expanded by introducing the latent variable \(\theta_{k} \in \varTheta = \left[ {1, \ldots ,p} \right].\)

$$p\left( {\left. {M_{i} } \right|D_{k} } \right) = \sum\limits_{j} {p\left( {\left. {M_{i} } \right|\theta_{k} = j,\;D_{k} } \right)p\left( {\theta_{k} = \left. j \right|D_{k} } \right)}$$
(8)

or in matrix notation

$${\mathbf{p}}\left( {{\mathbf{w}}_{k} } \right) = \left[ {{\mathbf{p}}\left( {\left. {{\mathbf{w}}_{k} } \right|\theta_{k} = 1} \right), \ldots ,{\mathbf{p}}\left( {\left. {{\mathbf{w}}_{k} } \right|\theta_{k} = p} \right)} \right]\;\left[ {p\left( {\left. {\theta_{k} = 1} \right|D_{k} } \right), \ldots ,p\left( {\left. {\theta_{k} = p} \right|D_{k} } \right)} \right]^{T}$$
(9)

Here, \(\varTheta\) represents a predictor mode in a similar sense to the dynamic mode S, and likewise \(\theta_{k}\) represents the prediction mode at time \(k.\;{\mathbf{p}}\left( {\left. {{\mathbf{w}}_{k} } \right|\theta_{k} = j} \right)\) is a column vector of the joint prior distribution of the conditional weights of each predictor model given the predictor mode \(\theta_{k} = j\). Generally, there is a one-to-one relationship between the predictor modes and the corresponding dynamic modes, i.e., p = n.

Data for estimating the distribution for \({\mathbf{p}}\left( {\left. {{\mathbf{w}}_{k} } \right|\theta_{k} = i} \right)\) is given based upon using a constrained optimization on the training data. In cases of labelled training data sets, the following applies:

$$\begin{aligned}\left\{ {{{\bf{w}}_{\left. k \right|{\theta_k} \,= \,i}}}\right\}{T_{S_i}} \,&=\, \arg\min\sum\limits_{m = k - N/2}^{k+N/2}{\fancyscript{L}\left({y\left({t_m}\right),\;{\bf{w}}_{\left.k\right|{\theta_k} = i}^T{{{\bf{\hat y}}}_i}}\right),\quad k\in{T_{S_i}}}\cr&\,{\text{s}} . {\text{t}} .\sum\limits_j{{\bf{w}}_{\left.k \right|{\theta_k} = i}^{\left( j \right)} = 1}\end{aligned}$$
(10)

where \(T_{{S_{i} }}\) represents the time points corresponding to dynamic mode S i , the tunable parameter N determines the size of the evaluation window and \(\fancyscript{L}\left( {y,\hat y} \right)\) is a cost function. From these data sets, the prior distributions can be estimated by the Parzen window method [11], giving mean \({\mathbf{w}}_{{\left. 0 \right|\theta_{k} = i}}\) and covariance matrix \({\mathbf{R}_{{\theta_k} = i}}\). An alternative to the Parzen approximation is of course to estimate a more parsimoniously parametrized probability density function (pdf) (e.g., Gaussian) for the extracted data points. For unlabelled training data, with time points T, the corresponding datasets \(\left\{ {\left. {{\mathbf{w}}_{k} } \right|\theta_{k} = i} \right\}_{T}\) are found by cluster analysis, e.g., using the k-means algorithm or a Gaussian Mixture Model (GMM) [11]. A conceptual visualisation is given in Fig. 2. Now, in each time step k, the \(\left. {{\mathbf{w}}_{k} } \right|\theta_{k - 1}\) is determined from the sliding window optimization below, using the current active mode \(\theta_{k - 1}\). For reasons soon explained, only \(\left. {{\mathbf{w}}_{k} } \right|\theta_{k - 1}\) is thus calculated:

Fig. 2
figure 2

Principle of finding the predictor modes for unlabelled data over the training data set time period T. For every time point t k  ∈ T, the optimal w k is determined by Eq. (10), where the optimal prediction \({\mathbf{w}}_{k} {\hat{\mathbf{y}}}\) (light green dash-dotted curve) formed from the individual predictions \({\hat{\mathbf{y}}}\) (the blue dashed and the red solid curves) is evaluating against the reference (black solid curve) using the cost function \(\fancyscript{L}\) over a sliding data window between t = k  N/2 and t = k + N/2. The aggregated set {w k } T is thereafter subjected to clustering to find the different mode centers \({\mathbf{w}}_{\left. 0 \right|\theta = i} ,\;i = \left[ {1, \ldots ,p} \right]\)

$$\begin{aligned}{\bf{w}}_{\left. k \right|{\theta_{k - 1}}}=&\,\arg \hbox{min} \;\sum\limits_{j = k - N}^{k - 1} {{\mu^{k -j}}\fancyscript{L}\left( {{y_j},\;{\bf{w}}_{\left. k \right|{\theta_{k-1}}}^T{{\bf{\hat y}}_j}} \right)} \\ & +\,\left( {{{\bf{w}}_{\left.k \right|{\theta_{k - 1}}}} - {{\bf{w}}_{\left. 0 \right|{\theta_{k - 1}}}}} \right){ \varLambda_{{\theta_{k - 1}}}}{\left({{{\bf{w}}_{\left. k \right|{\theta_{k - 1}}}} - {{\bf{w}}_{\left.0 \right|{\theta_{k - 1}}}}} \right)^T} \\&\,{\text{s}} . {\text{t}} .\sum\limits_j {{\bf{w}}_{_{\left.k \right|{\theta_{k - 1}}}}^{\left( j \right)}} = 1 \\\end{aligned}$$
(11)

Here, μ j is a forgetting factor, and \({ \varLambda_{{\theta_k} = i}}\) is a regularization matrix. To infer the posterior \({\mathbf{p}}\left( {\left. {\theta_{k} } \right|D_{k} } \right)\) in (9), it would normally be natural to set this probability function equal to the corresponding posterior pdf for the dynamic mode \({\mathbf{p}}\left( {\left. S \right|D_{k} } \right)\). However, problems arise if \({\mathbf{p}}\left( {\left. S \right|D_{k} } \right)\) is not directly possible to estimate from the dataset D k . This is circumvented by using the information provided by the \({\mathbf{p}}\left( {{\mathbf{w}}_{{\left. k \right|\theta_{k} }} } \right)\) estimated from the data retrieved from Eq. (10) above. The \({\mathbf{p}}\left( {{\mathbf{w}}_{{\left. k \right|\theta_{k} }} } \right)\) prior density functions can be seen as defining the region of validity for each predictor mode. If the \({\mathbf{w}}_{{\left. k \right|\theta_{k - 1} }}\) estimate leaves the current active mode region \(\theta_{k - 1}\) (in a sense that \({\mathbf{p}}\left( {{\mathbf{w}}_{{\left. k \right|\theta_{k - 1} }} } \right)\) is very low), it can thus be seen as an indication of that a mode switch has taken place. A logical test is used to determine if a mode switch has occurred. The predictor mode is switched to mode \(\theta_{k} = i\), if:

$$p\left( {\left. {\theta_{k} = i} \right|{\mathbf{w}}_{k} ,\;D_{k} } \right) > \lambda$$
(12)

where

$$p\left( {\left. {{\theta_k} = i} \right|{{\bf{w}}_k},\;{D_k}} \right) = \frac{{p\left( {\left. {{{\bf{w}}_k}} \right|{\theta_k} = i,\;{D_k}} \right)p\left( {\left. {{\theta_k} = i} \right|{D_k}} \right)}}{{\sum\nolimits_j {p\left( {\left. {{{\bf{w}}_k}} \right|{\theta_k} = j,\;{D_k}} \right)} p\left( {\left. {{\theta_k} = j} \right|{D_k}} \right)}}$$
(13)

A λ somewhat larger than 0.5 gives a hysteresis effect to avoid chattering between modes. Unless otherwise estimated from data, the conditional probability of each prediction mode \(p\left( {\left. {\theta_{k} = i} \right|D_{k} } \right)\) is set equal for all possible modes, and thus cancels in (13). The logical test is evaluated using the priors received from the pdf estimate and the \({\text{w}}_{{{\mathbf{k}}\left| {\theta_{{\mathbf{k}}} } \right.}}\) received from (11). If a mode switch is considered to have occurred (11) is rerun using the new predictor mode.

Now, since only one prediction mode θ k is active; (9) reduces to \({\mathbf{p}}\left( {{\mathbf{w}}_{k} } \right) = {\mathbf{p}}\left( {{\mathbf{w}}_{{\left. k \right|\theta_{k} }} } \right)\). The predictor mode switching concept is visualised in Fig. 3.

Fig. 3
figure 3

Predictor mode switching for an example with three individual predictor models. Step I At time instance t k the new \(w_{{\left. k \right|\theta_{k - 1} }}\) is determined from Eq. (11) In this case, the data forces the optimal weight away from the active prediction mode center. Step II The likelihood values \(p\left( {\left. {w_{k} } \right|\theta_{k} = i} \right),\;i = \left[ {1, \ldots ,p} \right]\) are calculated and if the condition according to Eq. (12) is fulfilled, a predictor mode switch occurs. Step III Based on the new predictor mode, Eq. (11) is rerun and the weight vector now gravitates towards the new mode center

3.1 Parameter Choice

The length N of the evaluation period is, together with the forgetting factor μ, a crucial parameter determining how fast the ensemble prediction reacts to sudden changes in dynamics. A small forgetting factor will put much emphasis on recent data, making it more agile to sudden changes. However, the drawback is of course that the noise sensitivity increases.

\({\varLambda_{{\theta_k} = i}}\) should also be chosen, such that a sound balance between flexibility and robustness is found, i.e., a too small \({\|\varLambda_{{\theta_k}=i}\|}_ 2\) may result in over-switching, whereas a too large \({\|{\varLambda_{{\theta_k} = i}}\|}_ 2\) will give a stiff and inflexible predictor. Furthermore, \({\varLambda_{{\theta_k}=i}}\) should force the weights to move within the perimeter defined by p(w|θ k  = i). This is approximately accomplished by setting \({\varLambda_{{\theta_k}=i}}\) equal to the inverse of the covariance matrix \({{\bf{R}}_{{\theta_k} = i}}\), thus representing the pdf as a Gaussian distribution in the regularization.

Optimal values for N and μ can be found by evaluating different choices for some test data. However, from our experience we have seen that N = 10–20 and μ = 0.8 are suitable choices for this application.

3.2 Nominal Mode

Apart from the estimated prediction mode centres, an additional predictor mode can be added, corresponding to a heuristic fall-back mode. In the case of sensor failure, or other situations where loss of confidence in the estimated predictor modes arises, each predictor may seem equally valid. In this case, a fall-back mode to resort to may be the equal weighting. This is also a natural start for the algorithm. For these reasons, a nominal mode θ k  = 0 : p(w k |θ k  = 0) ∈ N(1/m, I) is added to the set of predictor modes.

Summary of algorithm

  1. 1.

    Estimate m numbers of predictors according to best practice.

  2. 2.

    Run the predictors and the constrained estimation (10) on labelled training data and retrieve the sequence of \(\left\{ {{\mathbf{w}}_{\left. k \right|\varTheta = i} } \right\}_{{T_{{S_{i} }} }} ,\;\forall i \in \left\{ {1, \ldots ,n} \right\}\).

  3. 3.

    Classify different predictor modes, and determine density functions \({\mathbf{p}}\left( {{\mathbf{w}}_{\left. k \right|\varTheta = i} } \right)\) for each mode Θ = i from the training results by supervised learning. If possible; estimate p(Θ = i|D).

  4. 4.

    Initialize mode to the nominal mode.

  5. 5.

    For each time step; calculate w k according to (11).

  6. 6.

    Test if switching should take place by evaluating (12) and (13), and switch predictor mode if necessary and recalculate new w k according to (11).

  7. 7.

    Go to 5.

The ensemble engine outlined above will hereafter be referred to as Sliding Window Bayesian Model Averaging (SW-BMA) Predictor.

4 Choice of Cost Function \(\fancyscript{L}\)

The cost function should be chosen with the specific application in mind. A natural choice for interpolation is the 2-norm, but in certain situations asymmetric cost functions are more appropriate. For the glucose prediction application, a suitable candidate for determining appropriate weights should take into account that the consequences of acting on too high glucose predictions in the lower blood glucose (G) region (<90 mg/dl) could possibly be life threatening. The margins to low blood glucose levels, that may result in coma and death, are small, and blood glucose levels may fall rapidly. Hence, much emphasis should be put on securing small positive predictive errors and sufficient time margins for alarms to be raised in due time in this region. In the normoglycemic region (here defined as 90–200 mg/dl), the predictive quality is of less importance. This is the glucose range that healthy subjects normally experience, and thus can be considered, from a clinical viewpoint in regards to possible complications, a safe region. However, due to the possibility of rapid fluctuation of the glucose into unsafe regions, some considerations of predictive quality should be maintained.

Based on the cost function in [58], the selected function incorporates these features; asymmetrically increasing cost of the prediction error depending on the absolute glucose value and the sign of the prediction error.

In Fig. 4 the cost function can be seen, plotted against relative prediction error and absolute blood glucose value.

Fig. 4
figure 4

Cost function of relative prediction error

4.1 Correspondence to the Clarke Grid Error Plot

A de facto accepted standardized metric of measuring the performance of CGM signals in relation to reference measurements, and often used to evaluate glucose predictors, is the Clarke Grid Plot [19]. This metric meets the minimum criteria raised earlier. However, other aspects makes it less suitable; no distinction between prediction errors within error zones is made, switches in evaluation score are instantaneous, etc.

In Fig. 5, the isometric contours of the chosen function for different prediction errors at different G values has been plotted together with the Clarke Grid Plot. The boundaries of the A/B/C/D/E areas of the Clarke Grid can be regarded as lines of isometric cost according to the Clarke metric. In the figure, the isometric value of the cost function has been chosen to correspond to the lower edge, defined by the intersection of the A and B Clarke areas at 70 mg/dl. Thus, the area enveloped by the isometric border can be regarded as the corresponding A area of this cost function.

Fig. 5
figure 5

Isometric cost in comparison to the Clarke Grid

Apparently, much tougher demands are imposed both in the lower and upper glucose regions in comparison to the Clarke Plot.

5 Example I: The UVa/Padova Simulation Model

5.1 Data

Data were generated using the nonlinear metabolic simulation model, jointly developed by the University of Padova, Italy and University of Virginia, U.S. (UVa) and described in [24], with parameter values obtained from the authors. The model consists of three parts that can be separated from each other. Two sub models are related to the influx of insulin following an insulin injection and the rate of appearance of glucose from the gastro-intestinal tract following meal intake, respectively.

The transport of rapid-acting insulin from the subcutaneous injection site to the blood stream was based on the compartment model in [23, 24], as follows.

$$\dot{I}_{sc1} \left( t \right) = - \left( {k_{a1} + k_{d} } \right) \cdot I_{sc1} \left( t \right) + D\left( t \right)$$
(14)
$$\dot{I}_{sc2} \left( t \right) = k_{d} \cdot I_{sc1} \left( t \right) - k_{a2} \cdot I_{sc2} \left( t \right)$$
(15)
$$\dot{I}_{p} \left( t \right) = k_{a1} \cdot I_{sc1} \left( t \right) + k_{a2} \cdot I_{sc2} \left( t \right) - \left( {m_{2} + m_{4} } \right) \cdot I_{p} \left( t \right) + m_{1} \cdot I_{l} \left( t \right)$$
(16)
$$\dot{I}_{l} \left( t \right) = m_{2} \cdot I_{p} \left( t \right) - \left( {m_{1} + m_{3} } \right) \cdot I_{l} \left( t \right)$$
(17)
$$m_{2} = 0.6\frac{{C_{L} }}{{H_{Eb} \cdot V_{i} \cdot M_{BW} }}$$
(18)
$$m_{3} = \frac{{H_{Eb} \cdot m_{1} }}{{1 - H_{Eb} }}$$
(19)
$$m_{4} = 0.4\frac{{C_{L} }}{{V_{i} \cdot M_{BW} }}$$
(20)

Following the notation in [23, 24], I sc1 is the amount of non-monomeric insulin in the subcutaneous space, I sc2 is the amount of monomeric insulin in the subcutaneous space, k d is the rate constant of insulin dissociation, k a1 and k a2 are the rate constants of non-monomeric and monomeric insulin absorption, respectively, D(t) is the insulin infusion rate, I p is the level of plasma insulin, I l the level of insulin in the liver, m 3 is the rate of hepatic clearance, and m 1, m 2, m 4 are rate parameters. The parameters m 2, m 3, m 4 are determined based on steady-state assumptions—relating them to the constants in Table 1 and the body weight M BW .

Table 1 Generic parameter values used for the GSM and ISM

The initial stages of glucose metabolism, describing the digestive process and the flux of glucose from the intestines, have been modeled as follows:

$$q_{sto} \left( t \right) = q_{sto1} \left( t \right) + q_{sto2} \left( t \right)$$
(21)
$$\dot{q}_{sto1} \left( t \right) = - k_{gri} \cdot q_{sto1} \left( t \right) + C\left( t \right)$$
(22)
$$\dot{q}_{sto2} \left( t \right) = k_{gri} \cdot q_{sto1} \left( t \right) - k_{empt} \cdot q_{sto1} \left( t \right) \cdot q_{sto2} \left( t \right)$$
(23)
$$\dot{q}_{gut} \left( t \right) = - k_{abs} \cdot q_{gut} \left( t \right) + k_{empt} \cdot q_{sto} \left( t \right) \cdot q_{sto2}$$
(24)
$$R_{a} \left( t \right) = \frac{{f \cdot k_{abs} \cdot q_{gut} \left( t \right)}}{{M_{BW} }}$$
(25)

where, again following the notation in [21], q sto is the amount of glucose in the stomach (q sto1 solid, and q sto2 liquid phase), q gut is the glucose mass in the intestine, k gri the rate of grinding, k empt is the rate constant of gastric emptying, k abs is the rate constant of intestinal absorption, f is the fraction of intestinal absorption which actually appears in the blood stream, C(t) is the amount of ingested carbohydrates and R a (t) is the appearance rate of glucose in the blood. k empt is a nonlinear function of q sto and C(t):

$${k_{empt}}\left( {{q_{sto}}} \right) = {k_{min }} + k \cdot \left\{ {\tanh \left[ {\alpha \left( {{q_{sto}} - b \cdot G\left( t \right)} \right)} \right] + } \right.$$
(26)
$$\left. {\quad \quad \quad - \tanh \left[ {\beta \left( {q_{sto} - d \cdot G\left( t \right)} \right)} \right] + 2} \right\}$$
(27)

With \(k=\left({{k_{max}}-{k_{min}}}\right)/2,\quad\alpha=5/2D\left({1b}\right),\quad\beta=5/2Dc\) with parameters k max , k min , b, and d

Both models were evaluated using generic population parameter values according to Table 1.

The final part of the total model is concerned with the interaction of glucose and insulin in the blood stream, organs and tissue, including renal extraction, endogenous glucose production and insulin and non-insulin dependent glucose utilization. The model equations are partly nonlinear and are found in [24].

Using a parameter set corresponding to a subject with type 1 diabetes (retrieved from the authors of [24]), 20 datasets, each 8 days long, were generated. The timing and size of meals were randomized for each dataset, according to Table 2. The amount of insulin administered for each meal was based on a fixed carbohydrate-to-insulin ratio, perturbed by normally distributed noise, with a 20 % standard deviation.

Table 2 Meal amount and timing randomization

Process noise was added by perturbing some crucial model parameters p i in each simulation step; p i (t) = (1 + δ(t))\(p_{i}^{0}\), where \(p_{i}^{0}\) represent nominal value and δ(t) ∈ N(0,0.2). The affected parameters were (again following the notation in [24] )) k 1, k 2, p 2u , k i , m 1, m 30, m 2, k sc , and represents natural intrapersonal variability in the underlying physiological processes.

Two dynamic modes A and B were simulated by, after 4 days, changing four model parameters (following the notation in [24] ) k 1, k i , k p3 and p 2u , related to the endogenous glucose production and insulin and glucose utilization. This represents an example of shift in the underlying patient dynamics, which may occur due to, e.g., sudden changes in physical or mental stress levels.

A section of 4 days, including the period when the dynamic change took place, of a data set can be seen in Fig. 6. One of the 20 datasets was used for training and the others were considered test data.

Fig. 6
figure 6

The training data set. The upper plot represents 4 days of dynamic mode A data and the lower plot the corresponding last 4 days of dynamic mode B, where four model parameters have been modified. Example I: UVa/Padova Model

5.2 Predictors

For prediction modeling purposes, the system was considered to consist of three main parts in a similar sense as the simulation model was constructed. The absorption models of glucose and insulin were adopted and considered known. The outputs I p (t k ) and R a (t k ) from these models were fed into a linear state-space model of the Glucose-Insulin Interaction (GIIM), generating the final output—the blood glucose G(k) at time t k  ∈ (5, 10, …) min. Short-term predictions, p steps ahead, were evaluated using the Kalman filter:

$$\hat{x}\left( {k + 1} \right) = A\hat{x}\left( k \right) + Bu\left( k \right) + K\left( {y\left( k \right) - C\hat{x}\left( k \right)} \right)$$
(28)
$$\hat{x}\left( {k + p} \right) = A\hat{x}\left( {k + p - 1} \right) + Bu\left( {k + p - 1} \right)$$
(29)
$$\hat{G}\left( {k + p} \right) = C\hat{x}\left( {k + p} \right)$$
(30)

where meal and insulin announcements were assumed at least T PH minutes ahead, implying that u(k + 1) was known for all 0 < l < p.

Three models were identified using the N4SID algorithm of the Matlab System Identification Toolbox. Model order (2–4) was determined by the Akaike criterion [53]. The first model I was estimated using data from dynamic mode A in the training data, and the second II from the mode B data, and the final model III from the entire training data set. Thus, model I and II are each specialized, whereas III is an average of the two dynamic modes. The models were evaluated for a prediction horizon of 60 min.

5.3 Results

5.3.1 Training the Mode Switcher

The three predictors were used to create three sets of 60 min ahead predictions for the training data. Using (10) with N = 10, the weights w k were determined. The mode centers were found by k-means clustering, and the corresponding probability distribution for each mode, projected onto the (w 1, w 2)-plane, was thereafter estimated by Parzen window technique [11]. The densities are well concentrated to the corners [1,0,0] and [0,0,1], with means \({\mathbf{w}}_{{{\mathbf{0}}|{\mathbf{1}}}} = \left[ {0. 9 6,0.0 3,0.0 1} \right]\) and \({\mathbf{w}}_{{{\mathbf{0}}|{\mathbf{2}}}} = \, \left[ {0.0 3,0. 9 6,0.0 1} \right]\) defining the expected weights for each predictor mode. The nominal mode probability density function was set to \(N\left( {\frac{{\mathbf{1}}}{{\mathbf{3}}}\frac{{\mathbf{1}}}{{\mathbf{3}}}\frac{{\mathbf{1}}}{{\mathbf{3}}},\;{\mathbf{0}}.{\mathbf{1I}}} \right)\). In Fig. 7 all density functions, including the nominal mode, projected onto the (w 1, w 2)-plane, can be seen together.

Fig. 7
figure 7

Estimated probability density functions for the weights in the training data, including nominal mode. Example I: UVa/Padova model

5.3.2 Ensemble Prediction Versus Individual Predictions

Using the estimated probability density functions and the expected weights w of the identified predictor modes, the ensemble machine was run on the test data. An example of the distribution of the weights for the two dynamic modes A and B can be seen in Fig. 8.

Fig. 8
figure 8

Example of the distribution of weights in the test data using the estimated pdf:s and expected weights. Example I: UVa/Padova model

An example of how switching between the different modes occurs over the test period can be found in Fig 9.

Fig. 9
figure 9

Example of switching between different predictor modes in the test data. The transition from dynamic mode B to mode A takes place at 6000 min (c:a 4 days). Mode 3 represents the nominal mode. The late switch to predictor mode 2 in comparison to when the dynamic mode switch takes place is due to that the excitation for the first hours of the fifth day is low until the breakfast meal takes place, i.e., there is little incitement to switch predictor mode before that point. Example I: UVa/Padova model

For evaluation purposes, all predictors were run individually. In Table 3, a comparative summary of the predictive performance of the different approaches over the test batches, in terms of mean Root Mean Square Error (RMSE), is given. It was also noted that the merged prediction did not introduce any extra prediction delay in comparison to the best individual prediction (not shown).

Table 3 Performance evaluation by RMSE for the 60 min predictors using different approaches

6 Example II: The DIAdvisor Data

6.1 Data

Data from the clinical part of the DAQ trial and the DIAdvisor I B and C trials, conducted within the DIAdvisor project [30], were used. A number of patients participated in all three trials. Based on data completeness, six of these were selected for this study with population characteristics according to Table 4. All selected data were collected at the Montpellier Hospital, and each trial ran over three days. The patients received standardized meals where the amount of carbohydrates included in each meal was about 40 (45 in DAQ), 70 and 70 g, respectively. Additional snacks, in some cases related to counter-act hypoglycemia, were also digested. No specific intervention on the usual diabetes treatment was undertaken during the studies, since a truthful picture of normal blood glucose fluctuation and insulin-glucose interaction was pursued. Meal and insulin administration were noted in a logbook, glucose was monitored by the Abbott Freestyle [1] (DAQ) and the Dexcom Seven Plus [29] (DIAdvisor I) CGM systems, and frequent blood glucose measurements (>37 samples a day) were collected for calibration and as reference measurements. The CGM data were used for model identification, whereas the spline-interpolated frequent blood glucose reference measurements were used for validation purposes.

Table 4 Population Statistics of data

The first trial data (DAQ) were used to train the individual predictor models. The second and third trial data (DIAdvisor I.B and C) were used to train and cross- validate the SW-BMA, i.e., the SW-BMA was trained on B data and validated on C data, and vice versa.

6.2 Predictors

Three different predictors of different structure were developed within the DIAdvisor project, and used in this study; a state-space-based model (SS) [98], a recursive ARX model [36] and a kernel-based predictor [70]. For all three models, the CGM signal G CGM ( t) was considered a proxy for the blood glucose G(t), i.e., the lag between the interstitial glucose and the blood glucose, described in e.g. [87], was ignored.

The state-space model and the ARX model used the modeling approach depicted in Fig. 10, with insulin and glucose sub models according to Eqs. (14)–(27), and without interstitial and sensor dynamics modeling (M 2). The state-space model modeled the glucose-insulin interaction, and the glucose prediction, according to Eqs. (28)–(30). The ARX predictor was recursively updated at each time step with an adaptive update gain dependent upon the glucose level according to [36].

Fig. 10
figure 10

Overview of the modeling approach. Notation: Plasma Insulin I p (t), Rate of Glucose Appearance following a meal R a (t), Blood glucose G(t), Capillary glucose G C (t), Interstitial Glucose G I (t), CGM raw current signal G Iraw (t) and CGM signal G cgm (t). M 1 represent the model describing the glucose-insulin interaction in the blood and inner organs (GIIM), the M 2 model represents the diffusion-like relationship between blood and interstitial glucose and the CGM sensor dynamics, and M 3 is the joint model of M 1 and M 2

The kernel-based predictor did not directly utilize the insulin or meal data channels. Instead, the linear trend and offset parameters given by linear regression of recent CGM data were used as meta features to switch between different predefined kernel-based prediction functions, see [71] for a full explanation. Furthermore, this predictor was only trained on one patient data set and was thus considered patient invariant.

6.3 Evaluation Criteria

The prediction results were compared to the interpolated blood glucose G in terms of Clarke Grid Analysis [19] and the complementary Root Mean Square Error (RMSE).

6.4 Results

6.5 Training the Mode Switcher

6.5.1 Cluster Analysis: Finding the Modes

The three predictors were used to create 40 min ahead predictions for both training data sets \(D_{{T_{B\left( C \right)} }}\). Using (10) with N = 20, the weights \(\left\{ {{\mathbf{w}}_{k} } \right\}_{{T_{B\left( C \right)} }}\) were obtained; example depicted in the (w 1 , w 2 ) plane in Fig. 11. The weights received from the training are easily visually recognized as belonging to different groups (true for all patients, not shown). Attempts were made to find clusters using a Gaussian Mixture Model (GMM) by the EM algorithm, but without viable outcome. This is not totally surprising, considering, e.g., the constraints 0 ≥ w i  ≥ 1 and Σw = 1. A more suitable distribution, often used as a prior for the weights in a GMM, is the Dirichlet distribution, but instead the simpler k-means algorithm was applied using four clusters (number of clusters given by visual inspection of the distribution of \(\left\{ {{\mathbf{w}}_{k} } \right\}_{{T_{B\left( C \right)} }}\), providing the cluster centers \({\mathbf{w}}_{{\left. 0 \right|\varTheta_{i} }}\).

Fig. 11
figure 11

Example of distribution of weights in the training data by (10) and clusters given by the k-means algorithm. The red ellipses represent the fitted Gaussian covariances of each cluster (patient 0103, Trial B). Example II: DIAdvisor Data

The corresponding probability distribution for each mode \(p\left( {\left. {\mathbf{w}} \right|\varTheta_{i} } \right)\), projected onto the (w 1, w 2)-plane, was estimated by Parzen window technique, and an example can be seen in Fig. 12. Gaussian distributions were fitted to give the covariance matrices \({{\bf{R}}_{\varTheta_i}}\) used in (11).

Fig. 12
figure 12

Example of estimated probability density functions for the different predictor mode clusters in the training data (patient 0103, Trial B). Example II: DIAdvisor Data

6.5.2 Feature Selection

The posterior mode probability \(p\left( {\left. {\theta_{k} } \right|D_{k} } \right)\) is likely not dependent on the entire data D k , but only a few relevant data features, possible to extract from D k . Features related to the performance of a glucose predictor may include meal information, insulin administration, level of activity, measures of the glucose dynamics, etc. By plotting the training CGM data, colored according to the best mode at the prediction horizon retrieved by the training, interesting correlations become apparent (Fig. 13). The binary features in Table 5 were selected.

Fig. 13
figure 13

Example of CGM coloured according to best predictor mode 40 min ahead, together with active features at the moment the prediction was made (patient 0103, Trial B). Example II: DIAdvisor Data

Table 5 Selected features

When extracting the features, meal timing and content were considered to be known 30 min before the meal.

From the training data, the posterior mode probabilities \(p\left( {\left. {\theta_{k} = i} \right|f_{j} } \right)\), given each feature f j , were determined by the ratio of active time for each mode over the time periods when each feature was present. Additionally, the overall prior p(θ k  = i) was determined by the total ratio of active time per cluster over the entire test period.

The different features are overlapping, and the combinations thereof could be regarded as features by themselves. However, the data support for each such new feature would be small and could potentially disrupt, rather than improve, the switching performance. To resolve this issue, the features were not combined (apart from concurrent rising glucose and meal intake, which formed a new feature), and each feature was given different priority—only allowing only the feature of highest priority, \(f_{k}^{*}\) to be present at each time step t k . The priority rank was chosen to allow the more specific features to take precedence over the more general features. At each cycle, \(p\left( {\left. {\theta_{k} = i} \right|D_{k} } \right) = p\left( {\left. {\theta_{k} = i} \right|f_{k}^{*} } \right)\) was determined, and if no feature was active, \(p\left( {\left. {\theta_{k} = i} \right|D_{k} } \right)\) was approximated by the p(θ k  = i) estimate.

6.6 Prediction Performance on Test Data

Using the estimated mode clusters {w 0|i , R 0|i }, i = [1, …, M], and the estimated posteriors \(p\left( {\left. {\varTheta_i} \right|{f^*}} \right)\) from Trial B (C), the ensemble machine was run on the Trial C (B) data. The parameter μ was set to 0.8 and N to 20 min. An example of the distribution of the weights w k for the three predictors can be seen in Fig. 14.

Fig. 14
figure 14

Example of the distribution of weights in the test data using the estimated clusters and feature correlations (patient 0108, Trial B). Example II: DIAdvisor Data

Table 6 summarizes a comparison of predictive performance over the different patient test data sets for the RMSE evaluation criteria, and in Table 7 the evaluation in terms of Clarke Grid Analysis is given. The optimal switching approach, here defined as using the non-causal fitting by Eq. (10), is used as a measure of optimal performance of a linear combination of the different predictors.

Table 6 Performance evaluation for the 40 min SW-BMA prediction compared to the optimal switching and the individual predictors
Table 7 Performance evaluation for the 40 min SW-BMA prediction compared to the optimal switching and the best individual predictor by the amount of data (%) in the acceptable A/B zones versus the dangerous D and E zones

7 Discussion

Example I outlined how the technique may be applied to the specific example of diabetes glucose prediction under sudden changes in the underlying physiological dynamics. In this example, the merged prediction turned out to be the best choice. In Example II, applying the algorithm to real-world data, the SW-BMA has, for most patients, the same RMSE and Clarke Grid performance as the best individual predictor. In one case, the merged prediction clearly outperformed also the best predictor (RMSE/RMSEbest = 0.75). However, comparison to the optimal switcher indicates that there is still further room for improvement. To fill this gap, timely switching is most important. The prediction models in Example II were not specifically designed for specialisation, but are diversified in terms of modeling and parameter identification methods in relation to each other. The state-space model is patient-specific, with fixed parameter values after training—making it agile to interpersonal differences but more sensitive to time-variability. The model is invariant to the absolute glucose level. The ARX model, on the other hand, is recursively updated to capture time-variability, but the approach may be vulnerable to fluctuating system excitation conditions. Both models utilize the insulin and meal data inputs. The kernel-based predictor is generic over the patient cohort, and considers the dynamics to be related to the glucose level rather than directly to the inputs’ effects. Overall, the three models thereby complement each other in these aspects. The posterior mode probabilities, conditioned on each selected feature, show that some specialisation exists. For example, when feature 5 (meal onset) was active, cluster 3, dominated by the SS predictor, was clearly favoured an average (61 %). Exploiting these correlations may enhance timely switching, and further specialisation and diversification amongst the prediction models can thus be expected to further improve the added value of prediction merging.

The evaluation indicates that the proposed algorithm is robust to sudden changes and in reducing the impact of modeling errors. Apart from that, in many applications, transition between different dynamic modes is a gradual process rather than an abrupt switch, making the pure switching assumption inappropriate. The proposed algorithm can handle such smooth transitions by slowly sliding along a trajectory in the weight plane of the different predictors, perhaps with a weaker Λ if such properties are expected. Furthermore, any type of predictor may be used, not restricting the user to a priori assumptions of the underlying process structure.

In Takagi–Sugeno (TS) system, a technique that also gives soft switching, the underlying assumption is that the switching dynamics can be observed directly from the data. This assumption has been relaxed for the proposed algorithm, extending the applicability beyond that of TS systems.

In [86], another interesting approach to online Bayesian Model Averaging is suggested for changing dynamics. In this approach, the assumed transition dynamics between the different modes are based on a Markov chain. However, in our approach no such assumptions on the underlying switching dynamics are postulated. Instead, switching is based on recent performance in regards to the applicable norm, and possibly on estimated correlations between predictor modes and features of the data stream \(P\left( {\left. {\theta_{k} = i} \right|D_{k} } \right)\), see Eq. (13).

8 Conclusions

A novel merging mechanism for multiple glucose predictors has been proposed for time-varying and uncertain conditions. The approach was evaluated on both artificial and real-world data sets, incorporating modeling errors in the individual predictors and time-shifting dynamics.

The results show that the merged prediction has a predictive performance in comparison with the best individual predictor in each case, and indicates that the concept may prove useful when dealing with several individual (glucose) predictors of uncertain reliability—reducing the risk associated with definite a priori model selection, or as a means to improve predictive quality if the predictions are diverse enough.

Further research will be undertaken to investigate how interesting features correlated to expected predictor mode changes should be extracted, and in regards to the possibility of making the algorithm unsupervised.