1 Introduction

Diabetes mellitus (DM), also known as diabetes, is a disease mainly distinguished by high blood glucose levels [1]. It is classified in four groups: type 1, type 2, gestational and other specific types of diabetes [2]. The disease is mainly related to insulin, one of the hormones that regulates blood glucose concentration in the body.

Diabetes is an important health care problem worldwide. According to the last available global report by the World Health Organization [3], around 422 million adults were diagnosed with diabetes in 2014. The disease’s prevalence and the number of cases of diabetes continues to increase annually, expecting to reach 693 million by 2045 [4]. In 2012, diabetes was found to be the direct cause in 1.5 million deaths, in addition to 2.2 million deaths caused indirectly by complications of the disease [3]. Heart attacks, strokes and kidney failure are some complications that may arise from diabetes [5]. These complications are very costly to healthcare systems and could be drastically reduced with better control of the disease [6,7,8].

Type 2 diabetes is the most common form of diabetes. However, type 1 diabetes, the second most common, requires more attention during daily activities. The treatment to manage type 1 diabetes is based on the exogenous administration of insulin supported by exhaustive blood glucose monitoring, with which most of the complications can be prevented or minimized [9]. In addition to insulin therapy administration, nutritional therapy and physical exercise are essential for controlling the blood glucose levels [5]. Carbohydrates intake and physical exercise are the main disturbances in blood glucose concentration.

The goal of blood glucose control is to reach normoglycemia levels, normal concentration of blood glucose, minimizing the occurrences of hypoglycemia and hyperglycemia, low and high blood glucose levels, respectively. The American Diabetes Association (ADA) defines hypoglycemia as a measurable glucose concentration 70 mg/dl and hyperglycemia as glucose concentration 180 mg/dl. The 2019 ADA revision recommended the range 80–130 mg/dl as premeal glucose target [10].

Self-monitoring carried out by the patient is necessary to track and control blood glucose levels [11]. Patients need to measure their capillary glucose levels discretely, several times per day with a glucose meter, or using a continuous glucose monitoring (CGM) sensor [12, 13]. CGM sensors provide a continuous flow of information about the glucose concentration, allowing a better control of the disease in real time.

A glucose predictor would allow patients to plan their actions and therefore, deal with the disease better [14]. However, to have a useful predictor, two factors must be taken into account [15]. First, the prediction horizon (PH) has to be long enough (at least 15 min), so patients have time to adjust their treatment. Secondly, the fact that CGM systems measure glucose levels in interstitial fluid in the subcutaneous adipose tissue and not directly in blood means that there is a delay between the glucose concentration measured by CGM sensors and the blood glucose levels [16], so it is important that the PH is longer than the delay.

Artificial neural networks (ANNs) have been widely used to address forecasting problems in different fields. ANNs excel in comparison with traditional algorithms for time series forecasting because they are able to infer the complex relationships of the data and its nonlinearity [17,18,19,20].

ANNs have been used in the past, showing good performance in the prediction of blood glucose levels. Specifically, different techniques have been followed to predict the blood glucose level, such as (MLPs) [15], (CNNs) [21] and recurrent neural networks (RNNs) [22].

Pérez-Gandía et al. [15] proposed a MLP to predict glucose in real time from two different sensors. The model only used CGM measurements as input data. The algorithm provided three different PHs for 15, 30 and 45 min. Moreover, they compared their model with the auto-regressive model (ARM) of [23], obtaining better results using the MLP. Zhu et al. [21] proposed a model that was built from CNN layers and employed WaveNet algorithms [24]. They used the type 1 diabetes OhioT1DM dataset developed by [25] with patients’ glucose levels, insulin bolus, carbohydrates intake and time as input. They also used other input data such as heart rate and temperature. However, these new parameters hindered the performance of their model. A RNN, proposed by [22], used input data from a CGM sensor. They compared their RNN model and a MLP model with four different PHs: 15, 30, 45 and 60 min. As expected, the results showed that the RNN architecture was better in making predictions than the MLP model, because their feedback connections allow them to learn temporal dependencies [26].

Given the importance of short- and long-term prediction needs for diabetes patients, this paper presents a model using long short-term memory (LSTM) neural networks which extends the work of [15]. LSTMs are a type of RNNs that are able to maintain both short and long time dependencies through their states [27]. They are composed of gates that control the information that is stored in the states by altering the input and output.

Therefore, the main goal of this paper is to design and implement a set of predictors with the aim of predicting accurately the glucose level of type 1 diabetes and improving previous results in the literature. First, a main predictor model developed will consist of an LSTM fed with previous values of glucose, insulin bolus and meal intake. Based on the main predictor model, a set of predictors specialized in forecasting for different PHs will be deployed. Furthermore, the use of different (IDs) in the performance of the predictors are explored.

The paper is structured as follows. In Sect. 2, an introduction to the basic concepts of LSTM neural networks is carried out. Section 3 presents the dataset and the process followed to develop the set of predictors that are evaluated in this paper. Section 4 is dedicated to assess the set of predictors designed in Sect. 3. Section 5 provides conclusions of the paper and outlines possibilities for future research.

2 Theoretical background

LSTM neural networks were firstly introduced by [27]. LSTm is a type of RNNs developed specifically to solve the vanishing and exploding gradient problem [28, 29]. During the training of the RNNs with backpropagation through time (BPTT) [30], when the gradients are propagated over time, they tend to vanish or become unstable. This problem makes it very difficult for RNNs to learn long time dependencies. This issue is addressed by the LSTMs, being able to react to long- and short-term dependencies.

An LSTM cell (see Fig. 1) is composed of 4 layers, also called gates [26, 31]. The four gates are the forget gate (\(f_{k}\)), the input gate (\(i_{k}\)), the new cell state candidate gate (\(\widetilde{c_{k}}\)) and the output gate (\(o_k\)). Each gate has its own parameters, biases (b), weight for the input value (W) and weight for the hidden state value (U).

Fig. 1
figure 1

Computational graph of a LSTM cell [32]

The cell state is the memory of the cell and maintains the information through time. The inputs to the four gates are the correspondent \(x_{k}\) vector, where k is a sequence from \(k=\left\{ 1, \ldots , \tau \right\} \), \(\tau \) being a number of time steps that belongs to \(\mathbb {N}\), and the previous outputs (hidden states) from the LSTM (\(h_{k-1}\)).

The behavior of an LSTM cell is as follows:

  1. 1.

    The forget layer decides what percentage of the information from the cell state has to be left behind and which should remain in it. The activation function of this gate is a sigmoid (\(\sigma \)):

    $$\begin{aligned} f_{k}=\sigma \left( W_{f}x_{k}+ U_{f}h_{k-1}+b_{f}\right) \end{aligned}$$
    (1)

    where \(W_{f}\) is the input weight matrix, \(x_{k}\) is the input vector, \(U_{f}\) is the previous output weight matrix, \(h_{k-1}\) is the previous output and \(b_{f}\) is the bias.

    The output of the forget layer (\(f_{k}\)) is multiplied element-wise by the cell state (\(c_{k}\)).

    $$\begin{aligned} c_{f_{k}}= f_{k} \circ c_{k-1} \end{aligned}$$
    (2)

    This process is the one that erases the information of the cell state and occurs for the input and output gates.

  2. 2.

    The next step consists of updating the new information to the cell state, \(c_{i_{k}}\). The new cell state candidate gate (\(\widetilde{c_{k}}\)) proposes the new information that might be added to the memory of the cell, with a hyperbolic tangent (\(\tanh \)) as the activation function:

    $$\begin{aligned} \widetilde{c_{k}} = \tanh {\left( W_{i}x_{k}+U_{i}h_{k-1}+b_{i}\right) } \end{aligned}$$
    (3)

    On the contrary, the input gate uses a sigmoid function:

    $$\begin{aligned} i_{k}=\sigma \left( W_{i}x_{k}+ U_{i}h_{k-1}+b_{i}\right) \end{aligned}$$
    (4)

    Next, the cell state is calculated:

    $$\begin{aligned} c_{i_{k}}= i_{k} \circ \widetilde{c_{k}} \end{aligned}$$
    (5)
  3. 3.

    The new cell state is updated with the information provided by the two previous steps:

    $$\begin{aligned} c_{k}=c_{f_{k}}+c_{i_{k}} \end{aligned}$$
    (6)
  4. 4.

    Next, the output gate calculates its output by means of a sigmoid activation function:

    $$\begin{aligned} o_{k}=\sigma \left( W_{o}x_{k}+ U_{o}h_{k-1}+b_{o}\right) \end{aligned}$$
    (7)

    The cell state (\(c_{k}\)) is modified with a hyperbolic tangent (\(\tanh \)) activation function. Finally, the output \(h_{k}\) is calculated as:

    $$\begin{aligned} h_{k}=o_{k} \circ \tanh {\left( c_{k}\right) } \end{aligned}$$
    (8)

3 Design process

3.1 Dataset

The dataset includes time-dependent information about the glucose levels obtained with the Guardian® Real Time CGM sensor, the insulin data administered by each patient’s insulin pump, the food intake converted into grams of carbohydrates and the capillary blood glucose levels from a glucometer (used for the calibration of the CGM sensor). The glucose levels are sampled every 5 min, while the insulin data and the food intake are adjusted to this sampling period. Therefore, all the glucose profiles have a sampling period of 5 min, meaning each profile has 288 samples per day.

The first three variables (glucose, insulin and carbohydrates intake) from the aforementioned database will be used for prediction (see Fig. 2 for an example). The fourth variable, capillary blood glucose levels, were automatically used by the Guardian® Real Time to calibrate the output.

Fig. 2
figure 2

Representation of the raw parameters to feed the RNN. The top graph represents the glucose values obtained by the CGM sensor, the middle graph represents the insulin units provided, and the bottom graph shows the food intake in grams of carbohydrates

3.1.1 Data preprocessing

3.1.2 Glucose preprocessing

CGM profiles usually present three main artifacts that need to be fixed before the dataset can be used for glucose prediction:

  1. 1.

    There are missing values (values equal to zero), mainly caused by a problem with the connection between the CGM sensor and the receptor or problems that happen during calibration. A linear interpolation has been used to eliminate missing data.

  2. 2.

    There are non-derivative values that appear when a blood glucose calibration is entered to the CGM (see step in Fig. 3). A threshold of 30 mg/dl has been defined to compensate these distortions. Only steps greater than or equal to the threshold are compensated. The preprocessing to compensate works as follows: if a distortion greater than or equal to the threshold is found, then there is a check for an associated calibration value that has produced this distortion. If it is confirmed, a linear transformation is applied to the signal to minimize the distortion, see Eq. 9. The transformation technique is applied before and after the calibration point (CP), where the calibration is inserted into the CGM sensor measurement (\(k_{a}=0\)). This process is applied until a number of time steps (\(N_{p}\), \(N_{a}\)) is reached or when the distortion decreases to a threshold of 20 mg/dl.

    $$\begin{aligned} {\left\{ \begin{array}{ll} y^{'}_{\mathrm{CGM}}[i_{p}]= y_{\mathrm{CGM}}[i_{p}] + \mathfrak {C} \left( \frac{N_{p}-k_{p}}{N_{p}}\right) \\ y^{'}_{\mathrm{CGM}}[i_{a}]= y_{\mathrm{CGM}}[i_{a}] + \mathfrak {C} \left( \frac{N_{a}-k_{a}}{N_{a}}\right) \end{array}\right. } \end{aligned}$$
    (9)

    where

    $$\begin{aligned} {\left\{ \begin{array}{ll} i_{p}=\left( \hbox {CP}+k_{p}\right) \\ i_{a}=\left( \hbox {CP}-k_{a}\right) \\ k_{p}=\{1,\ldots ,N_{p}\}\\ k_{a}=\{0,\ldots ,N_{a}\} \end{array}\right. } \end{aligned}$$
    (10)

    where \(N_{p}\) is the number of time steps where the transformation is going to be applied before the CP, whereas \(N_{a}\) corresponds to the number of time steps after the CP. The corrective measure (\(\mathfrak {C}\)) is applied to the CGM signal from the CP and its effect decreases the further it moves from that point. \(\mathfrak {C}\) is directly proportional to the difference between the glucometer calibration measure \(\left( y\left[ \hbox {CP}\right] _{\mathrm{calibration}}\right) \) and the CGM value \(\left( y\left[ \hbox {CP}\right] _{\mathrm{CGM}}\right) \), see Eq. 11.

    $$\begin{aligned} \mathfrak {C}=\theta \left( y\left[ \hbox {CP}\right] _{\mathrm{calibration}}-y\left[ \hbox {CP}\right] _{\mathrm{CGM}}\right) \end{aligned}$$
    (11)

    where \(\theta \) is a constant that corresponds with the percentage of the corrective measure that is going to be applied to the signal. Its value is chosen between 0 and 1. Therefore, \(\theta \), \(N_{p}\) and \(N_{a}\) are selected for each profile to minimize distortion of the original signal, maintaining its trend and shape. Figure 3 shows an example of the results of the process compared to the pre-calibrated CGM signal and the calibration measures. When performing this process, the 57th profile, out of 58, showed not to have calibration data. Therefore, this profile has been dropped from the study because distortions could not be corrected.

  3. 3.

    Finally, there is noise in the CGM signal. To reduce the CGM signal noise and smooth the signal, a moving average filter is applied:

    $$\begin{aligned} \hat{x}[k]=\frac{1}{N}\sum _{n=0}^{N-1} x[k-n] \end{aligned}$$
    (12)

    where N is the length of the window, x is the raw signal and \(\hat{x}\) is the filtered signal. The moving average filter is a finite impulse response (FIR) filter. It only depends on the input parameters, and it is a time-dependent filter. For this application, the selected filter should fulfill the requirement of keeping the relevant information in the filtered signal. However, the filtering process causes a delay in the output signal and a potential loss of information. Taking all this into account, the chosen moving average filter has a length of \(N = 5\).

Fig. 3
figure 3

Step produced by the calibration. This graph represents the glucose values from the pre-calibrated CGM sensor, the calibration measures made by glucometer and the transformed CGM after the transformation

3.1.3 Insulin preprocessing

Insulin pumps try to emulate the functioning of a healthy pancreas with two kinds of insulin injections: basal insulin bolus and preprandial insulin bolus. The basal profile is intended to emulate the insulin secretion of the pancreas in fasting periods, and the preprandial insulin bolus is intended to control hyperglycemia after each meal. The preprandial bolus is several orders of magnitude bigger than the basal micro-boluses, usually infused every 3–5 min. The preprandial insulin boluses are represented as an impulse function (see Fig. 2). This may produce the ANN to ignore the value of these boluses of insulin or another unforeseen behavior. Therefore, a preprocessing technique is needed.

The chosen solution is to transform the preprandial bolus of insulin, that is an impulse function, into a rectangular function with the same area. The postprandial bolus is distributed during 12 samples that correspond to 1 h, simulating the delivery of an insulin bolus over an extended period of time (known as the extended bolus in insulin infusion systems), which avoids a high initial dose of insulin and allows an extended insulin action [33].

3.1.4 Food intake preprocessing

The food intake was manually provided by the patient. These values were transformed to grams of carbohydrates, representing the amount of carbohydrates eaten by the patient in the meal. Each registered meal is represented with only one value. The representation of this parameter shows that the food intake is also an impulse function (see Fig. 2).

The food intake parameter is transformed into a rectangular function with the same area that the impulse function with 12 samples equivalent to 1 h. The value of 1 h has been defined as the average absorption time of glucose for the different types of carbohydrates.

3.1.5 Normalization

LSTMs have sigmoid and \(\tanh \) activation functions in their gates (see Sect. 2). The value of the sigmoid output is in the range between 0 and 1, while the \(\tanh \) function range is between \(-\,1\) and 1. Therefore, all input parameters to the ANN have been normalized in the range of 0–1 to prevent unforeseen results and to facilitate the proper learning of the neural network. This process is also done to the targets of the ANN model. The min–max feature scaling is applied to normalize the input parameters and targets. Figure 4 shows the result of all the preprocessing techniques applied in the dataset.

Fig. 4
figure 4

Parameters after the application of the preprocessing techniques. The top graph represents the glucose values obtained by the CGM sensor. The middle graph represents the insulin units provided, while the bottom graph shows the food intake in grams of carbohydrates

3.1.6 Data arrangement

The data are to be arranged in input vectors, \(\mathfrak {I}= \{\mathfrak {I}_1, \mathfrak {I}_2,\ldots , \mathfrak {I}_n\}\), and targets vectors, \(\mathfrak {T}= \left\{ \mathfrak {T}_1, \mathfrak {T}_2, \ldots , \mathfrak {T}_n\right\} \), where n is the number of tuples of input and target vectors in the dataset. The input time series are: glucose (\(\varvec{x}=\left\{ \varvec{x}_{1}, \varvec{x}_{2}, \ldots , \varvec{x}_{n}\right\} \)), insulin (\(\varvec{w}=\left\{ \varvec{w}_{1}, \varvec{w}_{2}, \ldots , \varvec{w}_{n}\right\} \)) and food intake (\(\varvec{z}=\left\{ \varvec{z}_{1}, \varvec{z}_{2}, \ldots , \varvec{z}_{n}\right\} \)). These values are converted in a supervised learning problem using the sliding window or windowing method [34, 35], and it has been applied as follows:

  • The ID, also known as the sequence of values, determines how many past values of glucose, insulin or carbohydrates intake are fed as an input to the predictor. The PH states the time at which the prediction will be made in minutes.

  • The past values of glucose are used as input to the predictor, while the future values of glucose are the target of the predictor. Therefore, the time series have to be re-arranged in input and target values. The insulin and the carbohydrates intake parameters are only inputs to the system, so they only have to be re-arranged as input values.

  • \(\mathfrak {I}_i\) represents an input time series to the predictor at a specific time instant, and \(\mathfrak {T}_i\) is the corresponding future glucose value for that input. For specific values of ID and PH, every input (\(\mathfrak {I}_i\)) is arranged as a sequence of \(\tau =ID-1\) values in the past, that is:

    $$\begin{aligned} \mathfrak {I}_{i}= \left[ \begin{array}{cccc} x[k-\tau ] & \dots & x[k-1] & x[k]\\ w[k-\tau ] & \dots & w[k-1] & w[k]\\ z[k-\tau ] & \dots & z[k-1] & z[k]\\ \end{array}\right] \end{aligned}$$
    (13)

    The correspondent target (\(\mathfrak {T}_i[k]\)) is the glucose value for the expected PH in the future, that is:

    $$\begin{aligned} \mathfrak {T}_{i}= x[k+\hbox {PH}] = y[k] \end{aligned}$$
    (14)

The dataset is divided into three smaller subsets: training, validation and the test datasets. It is important to note that the data belonging to each patient should be only in one of the datasets, because these data are correlated among them and this could bring misleading results as the data would be biased. Taking all of this into consideration, the different subsets are distributed as follows: training dataset has 36 profiles from five patients, validation dataset has 6 profiles from one patient and test dataset has 15 profiles from two patients.

3.2 Neural network architecture

The chosen framework to implement the predictor is PyTorch [36]. PyTorch is an open source platform to develop neural network models, and it is deeply integrated with Python.

A predictor example, shown in Fig. 5, is composed of three layers: two LSTM layers and one final linear layer. The LSTM layers (see Sect. 2) have a variable number of neurons to be optimized, and the linear layer is a single perceptron with only one neuron and a linear activation function. Figure 5 also presents the unfolding and the data flow of the model. First, the input sequence (\(\mathfrak {I}_i\)) comes in the first LSTM layer. In this layer, the unfolding is processed and the output (\(h_{k}^1\)) is the input to the following LSTM layer. Note that the superscript denotes the layer of the model. The second LSTM layer processes the information and provides its output for the last value of the sequence, \(h_{\tau }^{2}\) to the linear layer. Finally, the linear layer computes the \(h_{\tau }^{2}\) to produce the expected value of glucose concentration (\(y_{\mathrm{PH}}\)) at the desired PH.

Fig. 5
figure 5

Generic architecture version of the predictor proposed

3.3 Parameter optimization

All IDs and PHs are combined to search the parameters that produce the best performance. In this paper, three values of ID are set for the network optimization: 5, 20 and 50 samples, defining a small, medium and large input vector. The search of the optimal parameters is divided in two parts, the architecture parameters and the parameters related to the training of the proposed predictor.

For the architecture optimization, the output linear layer will not be modified and its structure remains in one layer with one neuron. Therefore, during this Section when any reference is made to the number of layers or neurons, it refers to the LSTM layers of the model.

All LSTM layers will have the same number of neurons. The number of neurons that is considered in this study is from 10 to 100 in steps of 10. The number of LSTM layers is 1, 2 and 3. This gives a total of 30 combinations that are searched to find the best performance for the 3 IDs (5, 20 and 50 samples) and four PH (5 min, 15 min, 30 min and 45 min).

Moreover, two optimizers are chosen to train the proposed predictor: RMSprop [37] and Adam [38]. These optimizers are used to update the weights and biases of the ANN in the backpropagation algorithm. On the other hand, five different learning rates (\(\alpha \)) are considered \(\alpha =\left[ 0.0001, 0.0005, 0.001, 0.005, 0.01\right] \) to search the best parameters for the training of the ANN.

This parameter setup makes a total of 3600 combinations to be tested. From them, 12 different models shown in Table 1 are selected considering the best root mean square error (RMSE) value obtained in the prediction. The motivation of the decision to use this metric for parameter selection is twofold. Firstly, in contrast to Mean Square Error (MSE) it provides a statistic error that maintains the scale of the time series under assessment. Secondly, compared to other metrics such as the mean absolute error, it is sensitive to large errors that must be mitigated in potential medical applications involving patients. Results of these models are analyzed in detail in Sect. 4.

Table 1 Optimal parameters for each ID and PH

4 Results

4.1 Metrics

For all the metrics, let the prediction be \(Y=\left\{ y_1,\ldots ,y_n\right\} \), the expected output or targets \(T=\left\{ t_1,\ldots ,t_n\right\} \), and N the number of samples to be evaluated.

The MSE has been used as the loss function to be minimized, see Eq. 15. In the backpropagation algorithms, it is typically used to optimize artificial neural networks. The MSE simplifies the gradient calculations, and it is a globally differentiable function and provides an optimal and computationally simple optimization improvement.

$$\begin{aligned} \hbox {MSE}=\frac{1}{N}\sum _{i=1}^{N}\left( t_i-y_i\right) ^2 \end{aligned}$$
(15)

Nonetheless, the following metrics are used in the evaluation of the performance of the proposed predictor:

  • The RMSE, see Eq. 16:

    $$\begin{aligned} \hbox {RMSE} = \sqrt{\frac{1}{N}\sum _{i=1}^N\left( t_i - y_i\right) ^{2}} \end{aligned}$$
    (16)

    where \(t_i\) is the expected output (target) and \(y_i\) is the prediction made by the model. The RMSE, unlike the MSE, presents the error in the same unit as the variables that it is evaluating. Additionally, it is the most frequently used metric to quantify accuracy for glucose prediction models in the literature [15, 23, 39, 40].

  • The Pearson correlation coefficient (\(\mathrm {r}_{t,y}\)), see Eq. 17:

    $$\begin{aligned} {\mathrm {r}}_{t,y} = \frac{\sum _{i=1}^{N}\left( t_i-\bar{t}\right) \left( y_i-\bar{y}\right) }{\sqrt{\sum _{i=1}^{N}\left( t_i-\bar{t}\right) ^2}\sqrt{\sum _{i=1}^{N}\left( y_i-\bar{y}\right) ^2}} \end{aligned}$$
    (17)

    where \(\bar{t}\) and \(\bar{y}\) are the mean defined as \(\bar{t} =\frac{1}{N} \sum _{i=1}^{n}t\) and \(\bar{y} =\frac{1}{N} \sum _{i=1}^{n}y\). This coefficient gives information about the similarities between the target signal and the predicted signal. The Pearson correlation coefficient has been used previously by other authors, complementing RMSE to evaluate prediction models’ accuracy [41,42,43].

  • The delay associated with the prediction. The purpose of this metric is to assess the lag produced in the prediction with respect to the original glucose signal from the CGM sensor. The presence of delay reduces the accuracy when it is calculated as RMSE and reduces the expected PH. Therefore, it is necessary to calculate the delay to characterize the performance of a glucose predictor model. Several authors calculate this metric [15, 23, 41, 42, 44,45,46], although the algorithm to set the delay is not exactly the same for all them. To estimate the delay associated with prediction, the concept of mean delay presented by [23, 44] is extended by calculating every sample between the 25% and 75% of the slope length. The slope length is defined as the difference between the lowest point and the highest point. This metric measures the delay in the positive and negative slopes individually. Therefore, there are two values for this metric, one corresponds with the rising trend (upward delay) and the other with the falling trend (downward delay). The upward and downward delay are finally computed as the average of all rising trends and all the falling trends, respectively. Figure 6 shows an example of the slope points at which the delay is calculated.

Fig. 6
figure 6

Example of the delay calculation, each percentage is calculate for the length of each slope. For the rising slope, \(P_r\) is the lowest point and \(P_{r}^{\prime }\) is the highest point. In the falling slope, \(P_f\) is the highest point and \(P_{f}^{\prime }\) is the lowest point

4.2 Results

In Sect. 3.3, twelve predictors were tuned to predict the glucose concentration by using three input parameters. Each of the twelve predictors are specialized for an ID and a PH (see Table 1).

This study is divided in two parts: (1) an evaluation of the accuracy of the 12 different predictors for their corresponding PH; and (2) assessment of the effect of the prediction when different sequences of previous glucose, insulin and carbohydrates intake values or IDs are entered as input.

In the first part, the twelve predictors are evaluated with the test dataset. The process followed to prevent overfitting, called early stopping, stops the training when the performance for the validation dataset begins to worsen [47]. The results from these metrics are shown in Tables 2 and 4 for the training and test dataset, respectively. In addition, the predictions for a glucose profile of the test dataset for each model are displayed.

It is important to note that the prediction starts once the first ID values are fed. An example of this is presented in Fig. 7 with an ID of 5 and a PH of 45 min. The model starts to predict when it receives the first 5 values. Therefore, the samples are separated by 5 min, the model of an ID of 5 will wait 25 min to begin the predictions. Once the model has generated the first prediction, the model will output a value for each new input received.

Fig. 7
figure 7

Example of the effect of an ID of 5 in the prediction

First, it is important to verify that the models do not incur in overfitting. Tables 23 and 4 show the results for the training, validation and test datasets, respectively. The RMSE is very similar for the three datasets, being slightly lower for the validation dataset. There are some exceptions where the training dataset is slightly lower, these cases are the models for IDs 20 and 50 with PH 45 min. The other metrics have similar results for the three datasets.

Table 2 Metrics for all the models for the training dataset
Table 3 Metrics for all the models for the validation dataset
Table 4 Metrics for all the models for the test dataset

There are three models specialized in the prediction of each PH. Firstly, we assess the performance of the models that predict with a PH of 5 min. Figure 8 shows the prediction made by the predictor with IDs 5, 20 and 50 for a specific glucose profile. These three models have a high accuracy with error lower than 1 mg/dl, see Table 4, and the time series are almost identical. No delay is obtained from the output of these models.

The performance of the models with a PH of 15 min for the same glucose profile is shown in Fig. 9. These models improve the RMSE results presented in [15] for a PH of 15 min in a 55.44%, 56.57% and 62.32%, for the ID of 5, 20 and 50, respectively. Moreover, Fig. 9 shows that the predicted glucose signals resemble with the original values from the CGM sensor. This idea is reinforced by the value of the correlation, \({\mathrm {r}}_{t,y}\), that is very close to 1 (see Table 4). Furthermore, the models present a delay of 5 min in the upward slopes. However, the model is faster in the downward slopes, which explains why the models for IDs 20 and 50 have a null downward delay.

The glucose predictions made for a PH of 30 min and the same glucose profile are shown in Fig. 10. Comparing the results with those obtained in [15] for a PH of 30 min, these predictors improve the RMSE in a 27.79%, 28.31% and 30.09%, for the ID of 5, 20 and 50, respectively. The \({\mathrm {r}}_{t,y}\) is very close to 1 (see Table 4), demonstrating that the predicted signal is very close to the original signal. The delays obtained are similar for all the models that predict using a PH of 30 min.

Lastly, the models with a PH of 45 min for the same profile are shown in Fig. 11. The results from these predictors, see Table 4, improve the RMSE obtained in [15] by a 17.54%, 16.98% and 20.53%, for the ID of 5, 20 and 50, respectively. The similarity between the predicted signal and the original signal from the CGM sensor have worsened with regard to the previous cases where the \({\mathrm {r}}_{t,y}\) was very close to 1. The presented models are faster in the downward slopes than in the upward slopes.

After the study of the performance of all the predictors, the most accurate models are the ones with an ID of 50 (see Table 4). Moreover, these models are faster in both rising and falling slopes. This result was expected, because it is the model that receives more information in their inputs. However, the evaluation metrics in Table 4 present a similar performance between models with the same PH, independent of the ID.

Figures 8910 and 11 also present a high similitude between the prediction at the same PH. Some differences in the model performance start to appear when the PH is 30 min or 45 min. Specifically, the differences are better identified when there is a steep slope. In these cases, the predictors with ID of 5 and 20 tend to overshoot a bit more than the predictors with IDs of 50. Nevertheless, these differences are difficult to notice and are not very remarkable.

Fig. 8
figure 8

Predictions made at 5 min by the models with an ID of 5, 20 and 50

Fig. 9
figure 9

Predictions made at 15 min by the models with an ID of 5, 20 and 50

Fig. 10
figure 10

Predictions made at 30 min by the models with an ID of 5, 20 and 50

Fig. 11
figure 11

Predictions made at 45 min by the models with an ID of 5, 20 and 50

4.3 Discussion

Twelve models have been deployed to achieve the objective of predicting glucose concentration in patients with type 1 diabetes. These models may be grouped by their PH: 5 min, 15 min, 30 min and 45 min.

The predictors with a PH of 5 min are the most accurate models, but they do not provide enough time to anticipate therapeutic actions to predict adverse events (hypoglycemia or hyperglycemia) due to the delayed action of insulin infusions [15]. Consequently, the predictors with a PH of 5 min are not useful in a clinical scenario.

Predictors specialized to forecast at 15 min have shown a high accuracy; they can be used to improve the glucose control supported by artificial pancreas systems, which are designed automatic and real-time adjustments of insulin delivery with no human intervention [48]. The models with a PHs of 30 min or longer are good enough to allow a patient to do the necessary adjustments in insulin delivery and consequently to prevent the occurrence of hypoglycemia or hyperglycemia events. Both predictors with a PH of 30 and 45 min will provide a sufficient time interval to anticipate the patient’s action, both being useful to predict the glucose concentration. However, predictors with a PH of 30 min have the best compromise between available time interval to allow therapeutic adjustments and the accuracy of the predictions.

In Sect. 4.2, the performance of the predictors with the same PH and different ID was compared, assessing the effect of the ID in the predictor’s output. It is important to consider all the characteristics of the predictors before choosing one. These characteristics are the ID, PHs and the results of the metrics evaluation. Taking all this into account, the results from Table 4 are very similar for each predictor at the same PH. However, the predictors with the larger IDs have a longer initialization time. Therefore, with all the models having similar levels of accuracy, it would be advisable to choose models with smaller IDs, because they start to predict sooner.

Nevertheless, the most accurate predictors were those with an ID of 50 that corresponds with a period of 250 min. This happens because they are the models which more information received. However, the results from the other models are similar. The improvement of adding more information in the input is small. This effect is explained hereafter for several reasons.

First, patients with type 1 diabetes does not have endogenous insulin secretion; hence, their metabolisms do not have the mechanisms that regulate the glucose concentration. The LSTM neural networks are fed with the past glucose concentration values. The predictor could be giving priority to the recent values in time because they have a bigger impact in the future values of glucose. Hence, the most important characteristic taken into account by the predictors is the glucose trend, so they tend to ignore the oldest values that have a lower impact in the trend of the signal.

The mechanisms that patients have to change the glucose concentration are mainly the injection of insulin, which diminishes the glucose value, and the consumption of carbohydrates, which increases the glucose concentration. However, the effects of both insulin and carbohydrates intakes are not immediate. This leads to another possible problem: insulin starts to act 1 h after its injection and the absorption of the carbohydrates consumed takes between 30 min and 2 h depending on the type of carbohydrates. Therefore, these are the long term dependencies that the predictor should learn to improve the prediction.

The problem is that the insulin bolus and the food intake is presented a few times in each profile, 3 or 4 times on average, corresponding with main meals. Then, considering that the training dataset has 36 daily profiles, the number of available main meals is around one hundred. This might not be a sufficient number of training samples to learn the effects of possible long term dependencies in the future glucose values. Therefore, the oldest input values are being partially ignored.

In addition, LSTM have at least four more weights and biases than a multilayer perceptron. This is translated in LSTM needing bigger datasets for training and the proposed predictors could benefit from enlarging the number of profiles.

Moreover, long term dependencies in the glucose concentration might exist. Future research with profiles longer than 1 day would be needed to assess the impact of this long term dependencies.

Nevertheless, the twelve models presented are between 17 and 62% more accurate than those obtained in [15]. In addition, the twelve models are slower in the upwards trend than the models presented in [15] while performing faster in the downwards trends. In fact, this is a desirable characteristic because diabetic patients are usually more worried about hypoglycemia events than hyperglycemia events.

5 Conclusions

The aim of this paper was to continue the work presented in previous studies [15] with the design of LSTM based predictors. The results obtained are encouraging and support the use of LSTM for predictions of glucose concentration in patients with diabetes type 1.

The evaluation of the predictors was done at four different PHs 5, 15, 30 and 45 min, and the predictors were fed with three different ID 5, 20 and 50. From this architecture, twelve predictors were deployed, each one specialized in a PH with a specific ID.

Results show that predictors with a PH of 30 min provide the best compromise between the amount of time that a patient has to modify the treatment and the performance of the model predictions. In addition, it is recommended to use an ID of 5 because it has a much shorter initialization time. Nonetheless, the rest of the models could be also used depending on the prediction needs that have to be covered.

Results presented in this paper open new lines of future work. The most important aspect is the number of patients and the amount of data. Some reasons have been proposed to explain why the different IDs give such similar results. The most probable cause is the need for more data to train the predictor based on LSTM. Nonetheless, a complementary approach using the paradigm of few-shot learning, where the models are trained with small amounts of data, is recommended (see [49, 50]).

Moreover, in this paper only three parameters are used to calculate the futures values of glucose. However, the process that regulates the glucose concentration is a complex mechanism and other parameters such as physical activity has an impact. Therefore, a future research line might be to add new parameters to feed the predictors and assess the importance of these parameters in the performance of the predictor. Finally, an automatic approach to correct the sensor artifacts and an automatic optimization of all ANN parameters would be desired.