Keywords

1 Introduction

According to the Global Disease Burden (GBD) study, high blood pressure (BP) (i. e. hypertension) is the risk factor that leads to more deaths worldwide [16]. The standard way of monitoring this condition is through the measurement of BP using an uncomfortable cuff-based device [25]. Fortunately, comfortable and common wearables can already detect changes in the flow of blood through a photoplethysmography (PPG) sensor [1]. The PPG signal (photoplethysmogram) obtained from it is already used with success to estimate heart rate [19] and, has the potential to go beyond that into accurate BP prediction [2, 5].

Most of the work in this area focus on building predictive models for patients in intensive care units (ICUs) [12, 20, 24]. However, data collected from regular life contain motion artefacts that are not observed in intensive care. Additionally, models that work on healthy populations [17, 18] should also be validated on hypertensive populations for guarantying their applicability in BP monitoring. Hence, in our work we focused on assembling a dataset containing data from subjects with hypertension (HYPE) during a stress test and 24-h monitoring.

We then evaluated machine learning (ML) models for predicting BP from PPG in the HYPE dataset and also in a dataset from healthy subjects during a stress test (EVAL). From the PPG signals, we extracted features from the time domain plus their image representations. Errors as low as the ones in the literature—for patients in the ICU or healthy subjects—could not be reproduced, even after processing the PPG signals with diverse time windows and filters.

This work is detailed as follows: Sect. 2 shows previous work in the field and Sect. 3 describes the datasets and methods we used to predict BP from the PPG signal. In Sect. 4 we convey our findings and results, followed by a discussion in Sect. 5 and Sect. 6 describing the implications of this work.

2 Related Work

Existing work focuses on predictive models using MIMIC [7], a dataset that contains physiological signals including PPG and ambulatory BP (ABP) from patients in ICUs. Kurylayak et al. [12] and Wong et al. [24] have both applied artificial neural networks (ANN) to predicted BP in this dataset and reported success. However, they used unknown or small sample sizes as can be seen in Table 1. Moreover, Kurylayak et al. only extracted time domain features from the PPG signal while Wong et al. also extracted frequency domain ones. Conversely, Slapničar et al. [20] tried a spectro-temporal ResNet with all features in a larger sample size but could not report the same success as his predecessors.

Others have tried to collect data from healthy subjects in daily life such as Lustrek et al. [17]. They have used the device empatica E4Footnote 1 and evaluated a range of machine learning (ML) techniques, achieving the best results with an ensemble of regression trees and the leave-one-subject-out (LOSO) validation strategy. However, they had to use ground truth BP from each subject to personalize the algorithm. Lastly, there is the work of Manamperi et al. [18], in which they evaluated ANN in MIMIC and in a set with data from voluntary subjects (assumed as healthy). They claim to have done the second evaluation in a non-clinical scenario, but the subjects were mainly at rest in their experiment.

Therefore, the current state-of-the-art does not give yet conclusions about the use of PPG to predict BP in diverse populations and in daily life. There is a clear need for more comparative studies both with healthy and hypertensive subjects and in different scenarios, especially outside of controlled conditions.

Table 1. Blood Pressure Prediction from Photoplethysmograms

3 Methods

3.1 Datasets

In our work we used two datasets: one created by us with data from a hypertensive population (HYPE) and one that is openly available containing data from a healthy population (EVAL). It should be noted that both datasets recorded patients during a stress test and HYPE also during 24-h monitoring. We describe the two datasets below.

  • HYPE. This dataset was created by us as part of the CardioVeg study (NCT03901183) approved by the Ethics Committee from Charité, Berlin (no. EA4/025/19). Data was collected from 12 subjects (6 female) in the age range of 31–75 (median 60) that had hypertension. The study collected data from a (1) stress test and from (2) 24-h monitoring using the empatica E4 wristband as the PPG source and the Spacelabs (SL 90217) BP monitor. (1) Stress Test. The subjects followed a protocol in which they watched a relaxing video for 5 min then had their BP taken by a physician five times with an interval of 1 min per measurement [25]. Then, the patients biked in an ergonomic bike from 5 to 10 min and relaxed again. During the second relaxation phase their BP was measured again 5 times with a 1 min interval. This dataset contains a total of 95 BP recordings. One subject could not bike due to extreme high BP and another one had a failure in the wearable device. Therefore, this experiment had 10 subjects (5 female). (2) 24 H. In this phase, the same subjects from the stress test were monitored for 24-h during regular day activities. The Spacelabs monitoring device was configured to measure BP every 30 min during the day and every hour during the night. This dataset contains a total of 464 BP recordings and all 12 subjects were measured.

  • EVAL. This dataset was generated by Esmaili et al. [3]. The original paper tried to estimate BP based on pulse transit time (PTT) and pulse arrival time (PAT). Both variables are derived from the differences between the PPG and ECG signals. This data was collected from 26 healthy subjects in the age range of 21–50 years. The subjects were required to run for 3 min at the speed of 8 km/h to induce perturbations in their BP values. Directly after the exercise the subjects were made to sit upright and BP values were measured along with PPG and ECG. A force-sensing resistor (FSR) was used under the BP monitor cuff to measure the instantaneous cuff pressure. With the FSR it was possible to pin point the exact time when the SBP and DBP were measured. A total of 152 BP values were recorded in this dataset.

3.2 Handcrafted Feature Extraction Methods

Our first approach entailed extracting handcrafted features from the PPG signal. Time windows of 15, 30 and 45 s around the BP measurement were used for our experiments. To eliminate motion artefacts induced by wrist movements sections in which the Euclidean norm of x-, y- and z-acceleration lied outside of an interval of 25% of the standard deviation around the sample mean, for the current window, were removed from consideration. The motion removal was only done for the HYPE dataset as the EVAL dataset did not contain any motion signals corresponding to the PPG recording. We also experimented with signal normalization and filters such as Chebyshev II and Butterworth, since they were reported as the best filters for PPG signals [15]. For the processed signal, the PPG cycles were then identified with a standard peak detection function.

All detected cycles in the same window were combined into a custom PPG signal template (details in Sect. B), following a procedure described by Li and Clifford [13]. Individual cycles were then compared with the template using two signal quality indices (SQI): (1) direct linear correlation and (2) direct linear correlation between the cycle, re-sampled to match the template length, and the template itself. Only if both correlations lied above 0.8, the cycle was further processed to extract features. This resulted in some BP intervals not having any features extracted since no cycles matching the template were identified.

After the clean PPG cycles have been identified, time domain features were extracted and the detailed list can be found in the Appendix (Table 4). The first step was to identify the first peak in the cycle, which corresponded to the systolic peak. Then for various percentages of the peak amplitude, we extracted the time between systolic peak and end of the cycle (\(DW_n\)), start of the cycle and end of the cycle (\(SW_n+DW_n\)), and the ratio between the time in the cycle before and after the systolic peak (\(DW_n/SW_n\)). For every window, the mean and variance of each feature were computed and used as input for the models.

3.3 Image Representation Methods

An alternative approach to the manual feature extraction has recently gained much popularity involving convolutional neural networks (CNNs). The approach is to represent the waves as images and then use a transfer learning method based on pretrained CNNs to learn embedding from the images and use them to predict BP. The two different image-form representations of PPG signals that we tested were spectrograms and scalograms, described below.

  • Scalograms. A scalogram is usually plotted as a graph of time and frequency and it represents the absolute value of the Continuous Wavelet Transform (CWT) coefficients of a signal. The scalogram-CNN based approach was first discussed in Liang et al. [14]. However, it was only evaluated for hypertension stratification, not BP prediction. Before passing the signal to the CWT, we detrended it, i.e. subtracted the mean value from the input signal. CWT is a convolution of the input data sequence with a set of functions generated by the base wavelet. We used the complex Morlet wavelet function as the base wavelet, which is given by:

    $$\begin{aligned} \varPsi (t) = \frac{1}{\sqrt{\pi B}}exp^{-\frac{t^2}{B}}exp^{2\pi Ct} \end{aligned}$$
    (1)

    The value of bandwidth frequency (B) and center frequency (C) was chosen to be 3 and 60 in the above equation, following the work of Liang et al. [14]. Compared to a spectrogram, a scalogram is usually better at identifying the low-frequency or fast-changing frequency component of the signal.

  • Spectrograms. A spectrogram displays changes in the frequencies in a signal over time. A third dimension indicating the amplitude of a particular frequency at a particular time is represented by the intensity or color of each point in the plot. The spectrogram-CNN approach to predict BP was first discussed in Slapničar et al. [20]. Similar to scalograms, we detrended our signal before generating the spectrogram plots. To generate a spectrogram, digitally sampled signals in the time domain are broken up into windows, which usually overlap, and they are Fourier transformed to calculate the magnitude of the frequency spectrum for each window [21].

Fig. 1.
figure 1

(a) displays a sample spectrogram and (b) a scalogram with the complex Morlet wavelet function used as the base wavelet for the same PPG signal.

Figure 1 depicts a sample spectrogram and scalogram generated from a PPG snippet of 15 s. The image representations of the signal were then fed into a ResNet architecture to learn the image embeddings [9]. The Residual Network or ResNet design enables us to train very deep neural networks without running into the vanishing gradient problem. Since our datasize is very small, instead of training a network from scratch, we decided to take a network which was already trained on the ImageNet dataset [8]. In particular, we used the ResNet18 architecture and took the embeddings from the penultimate layer of the network. We also experimented with Alexnet, but Resnet18 always performed marginally better [11]. This might be due to the fact that the penultimate layer of the Resnet18 generates a 512 length embedding, whereas the AlexNet generates a embedding of length 4096. The larger size of the input vector, in spite of using feature selection and strong regularization techniques, might make it challenging for the models to learn from, due to the small data size.

3.4 Machine Learning Models

Previous works show that machine learning algorithms perform well in predicting BP from features derived from PPG and/or ECG. We have employed in our experiments three popular machine learning algorithms: (a) Generalised Linear Models (GLM) with Elastic Net regularisation [26], (b) Gradient Boosting Machines (GBM) [4], and (c) a recent more efficient implementation of GBM called LightGBM (LGBM) [10] to predict the systolic and diastolic BP.

For prediction from the image embeddings, we used a Recursive Feature Elimination (RFE) technique with a support vector machine (SVM) with linear kernel as the base estimator, before pushing the vectors into the models.

3.5 Experimental Settings

In order to train models that are robust and well generalizable, we used a leave-2-subjects-out cross validation for all models, i.e. at every iteration we use data from 2 subjects as the test set, trained our models on the remaining data and repeated this procedure till all subjects have been at some point used as the test set. All hyper parameters were optimized empirically. We evaluated the models based on the mean absolute error (MAE). The MAE was calculated at each iteration and we calculated the mean and standard deviation of these values.

4 Results

In this section we report our experimental results. In Table 2 and Table 3 we show the comparison of the MAEs for predicting systolic blood pressure (SBP) and diastolic blood pressure (DBP) respectively, between all models in the different datasets. The cells in these table contain the mean and standard deviation (in parenthesis) of the MAE of all cross validation folds. Noticeably, in the HYPE dataset feature extraction methods consistently outperformed the image based methods. In EVAL the spectrogram-representation method outperformed the other two approaches. In both datasets, the best results for the spectrogram based approach are usually marginally better than the best results of the scalogram based approach. For the image based methods the more advanced machine learning models such as LGBM and GBM clearly outperformed the GLM model. This is most probably due to the comparatively large dimension of the input image embeddings. For the feature extraction based methods this difference is not so prominent, and in some cases the GLM turns out to be the best performing model. In general, based on the MAE values, predicting SBP appears to be more difficult than DBP which is consistent with previous literature (see Table 1).

5 Discussion

5.1 Clinical Relevance

Cuff-less and continuous methods of measuring BP are particularly attractive as BP is one of the most important predictors of long term cardiovascular health [6]. Prediction models for BP based on PPG signals can be a very important stride in that direction. But for reliable continuous monitoring of BP, these models need to perform well during regular day-to-day activities and also for different patient populations. Apart from MIMIC there does exist a few other studies that try to collect PPG and corresponding BP signals following a strict protocol (similar to HYPE: stress-test). To the best of our knowledge, our 24-h dataset is the first attempt to collect this data from an uncontrolled environment where the subjects were free to do anything. Our evaluations do underline that it is indeed more challenging to accurately predict BP in such an uncontrolled environment.

5.2 Technical Relevance

In this work we impartially evaluated different models and approaches for BP prediction from PPG. Most of the methods have only been previously validated on MIMIC. Also, due to the large volume of existing work, very often the models were not compared against all available approaches. To the best of our knowledge, this is also the first work to compare the scalogram, spectrogram and feature extraction approaches. We employ strong cross validation methods to make sure our results are robust. Our models and code are available openly to make sure this results can be reproduced and also applied to similar datasets when needed.

Table 2. MAE of the different models for SBP prediction in different datasets
Table 3. MAE of the different models for DBP prediction in different datasets

5.3 Limitations and Future Work

The major limitation of our work is related to the small size of the datasets we used. For that reason, it was not possible to train a deep Long Short Term Memory (LSTM) network, which in a few recent papers have demonstrated very promising results [23]. In future work, we would like to extend our dataset with more diverse patient populations and also with a longer observation period per patient. We might consider data augmentation techniques as well. This will allow us to apply more data-demanding learning algorithms and, at the same time, to investigate how models trained in one population perform in a different one.

6 Conclusion

In conclusion, we presented a comprehensive comparison of different machine learning approaches to predict BP from PPG in two different datasets. We demonstrate that despite the plethora of work in this area, there exists a dearth of models that perform well in uncontrolled environments when the subjects indulge in various day-to-day activities. To achieve a MAE (\(\le \)5 mmHg), which is considered good by the Association for the Advancement of Medical Instrumentation\(^{\textregistered }\) (AAMI) [22] we still have a long way to go. Moreover, we showed that for small to medium sized datasets feature extraction methods can produce better results than the recent image based approaches. We hope our work will inspire others to dig deeper into the generalizability and improve the accuracy of these models.