1 Introduction

With the advent of technology, computer-aided diagnostic systems have emerged as a powerful force in the medical field, serving as effective tools in the prediction and detection of various diseases. These systems have introduced a new layer of transparency and reliability to medical decision-making processes. The incorporation of Artificial Intelligence (AI) in healthcare, particularly within hospital settings, is transforming the medical landscape. The enhanced accuracy in diagnosis and provision of medical advice are just a few of the benefits reaped from this integration.

Mental stress can have a severe impact on a person's overall well-being and can lead to a variety of physical and emotional health problems (Harvard Health, 2020). Some of the common symptoms of mental stress include difficulty sleeping, fatigue, irritability, difficulty concentrating, and changes in appetite. Mental stress is primarily a physiological reaction to external stimuli that are brought on by the sympathetic nervous system. During this stage of the response, a variety of chemicals are produced, including cortisol and adrenaline, among others. These chemicals cause an increase in the rate of the heartbeat and the rate of breathing, in addition to a tightening of the muscles. These physiological changes are getting the body ready for a physical response (called "fight-or-flight") (Bracha et al., 2004). Chronic mental stress can also increase the risk of developing more serious health problems, such as heart disease, high blood pressure, and depression. It is vital for individuals to find ways to manage their stress and maintain their mental health (Adarsh et al., 2023; Chrousos & Gold, 1992; McEwen & Stellar, 1993; Rosmond & Björntorp, 1998; Selye, 1976).

Heart rate variability, commonly referred to as HRV, is a metric that assesses the fluctuations in the time interval between heartbeats. It serves as a valuable tool in determining an individual's physiological condition (He et al., 2019; Moridani et al., 2020; Oskooei et al., 2019). Higher HRV is generally associated with a greater ability to adapt to stress and a healthier overall autonomic nervous system function, while lower HRV is associated with increased stress and decreased health. HRV can be measured using various techniques, including electrocardiography (ECG), photoplethysmography (PPG), and accelerometery. It is important to keep in mind that HRV features should be interpreted in the perspective of the person's overall physical and mental state, and they should not be used alone to draw any conclusions about stress levels.

Ultra-short HRV (US-HRV) (Salahuddin et al., 2007) refers to the measurement of HRV over very short periods of time, typically ranging from a few seconds to a few minutes. It is typically measured using continuous ECG or PPG recordings and can be used to assess an individual's physiological state in real-time. One significant difference between HRV and US-HRV is the time frame over which they are measured. HRV is typically measured over periods of several minutes to several hours, while US-HRV is measured over much shorter periods of time. Thus, HRV provides a longer-term view of an individual's physiological state, while US-HRV gives a more immediate and dynamic view. Further, US-HRV is more sensitive to changes in an individual's physiological state than HRV. This is because US-HRV captures changes in HRV that occur over very short periods of time, which may be missed in longer-term HRV measurements.

Deep learning has seen widespread use in the analysis of image data using convolutional neural networks (CNNs). The success of CNNs in areas with common and regular domains, such as computer vision and speech recognition, has paved the way for an increased focus on Graph Neural Networks (GNNs). These networks are predominantly engaged in reinterpreting the concept of convolution for graph structures (Wu et al., 2021). Graph Convolutional Networks (GCNs) have witnessed a surge of applications across various fields in recent years, particularly in the healthcare sector. Such as in brain analysis (Li et al., 2021), mammography assessments (Du et al., 2019), and image segmentation tasks (Soberanis-Mukul et al., 2020). Deep neural networks, despite their ability to model complex relationships between input and output variables, are often seen as "black box" models (Shao et al., 2021). This is due to their inherent complexity, which makes it challenging to understand the role a specific input feature plays in generating the output. This lack of interpretability is a significant obstacle, particularly in clinical applications where decision-making processes need to be explained to comply with regulatory requirements, such as the European Union's General Data Protection Regulation (GDPR). The GDPR demands that any automated decision-making processes must be explainable and that patients have the right to refuse automated decisions.

There have been a significant amount of research on using machine learning (ML) and deep learning (DL) algorithms to detect stress using ultra-short HRV (Ishaque et al., 2021; Kim et al., 2018; Lawanont et al., 2019; Pourmohammadi & Maleki, 2020; Rodríguez-Arce et al., 2020; Salahuddin et al., 2007; Sánchez-Reolid et al., 2020; Zalabarria et al., 2020; Zangróniz et al., 2018; Zubair & Yoon, 2020). However, with the increase in parameters and inputs, the ML/DL models generated become more complex and huge in size, and new methods for trimming down their size is essential if they are needed to be implemented on resource-constrained devices.

Model pruning (Abbasi-Asl & Yu, 2021; Dong et al., 2022) is a technique for reducing the size and complexity of a machine-learning model by removing unimportant or redundant parameters. The fact that more rounds of pruning are required to attain the appropriate amount of compression causes the pruning process to go more slowly than it would otherwise. They either fail to account for topology changes while compressing the models or depend on rules or embeddings that are manually built while neglecting rich topological information.

1.1 Problem definition

To the best of our knowledge, most of the existing machine learning/deep learning-based clinical decision support systems suffer from the lack of interpretability in these systems. Furthermore, the highly complex and huge size nature of generated machine learning/ deep learning models make these systems unimplementable on resource-constrained devices and embedded systems. By combining pruning and quantisation into a single process and using explainability as a guide, this study achieves a smaller model size while maintaining competitive performance and preserving important contributing features. This paper develops a methodology that can effectively address the challenge of deploying a method to detect stress on resource-constrained devices while maintaining its effectiveness and achieving explainability. The study develops a novel approach for model compression and optimization, with the goal of significantly reducing the size and complexity of the stress detection algorithm without compromising its accuracy and performance. The ultimate purpose is to enable the implementation of the stress detection method on wearable sensors and other resource-constrained devices, providing individuals with convenient access to real-time stress assessment and management tools. In this study, explainability has been incorporated through the use of the SHAP method as an aid in feature selection and network pruning. It is used to help explain the network produced by the graph convolutional network and to provide insights into the contributing features for stress detection using US-HRV. In this study, the shapely values serve as a reference to ensure that the pruning process is not only reducing the size and complexity of the model but also preserving the most effective contributing features for stress detection.

The salient contributions of this paper are as follows:

  1. 1.

    It introduces an innovative pruning technique that utilizes graph convolutional networks to identify valuable contributing features and devise an effective compression strategy for stress detection using wearable sensors.

  2. 2.

    Our proposed approach integrates both pruning and quantization processes, which leads to a reduced model size while still delivering competitive performance levels.

  3. 3.

    We employ the SHapley Additive exPlanations (SHAP) method to assist in feature selection and network pruning, thereby understanding the subgraphs generated from the GCN for the required operations.

  4. 4.

    Our method significantly decreases sparsity by approximately 60%, with a minimal drop in accuracy (less than 1%), which further illustrates the efficiency of the proposed model.

The remaining parts of the paper are structured in the following manner. In Sect. 2, we discuss the current state of the art that are used in stress diagnosis. Section 3 presents our proposed approach of explainable GCN combining pruning and quantisation for mental stress detection. The experimental setup and analysis of the results are covered in Sect. 4, followed by conclusions in Sect. 5.

2 Related works

2.1 Machine learning and XAI for healthcare

Explainable Artificial Intelligence (XAI) methods have been developed to make classification decisions of complex machine learning models interpretable. These methods typically follow one of two approaches: functional or message passing. The functional methods focus on localized prediction analysis and include techniques like sensitivity analysis, Taylor series expansion, and model-agnostic approaches like LIME and SHAP. Conversely, message-passing methods generate explanations by running a backward pass through a computational graph, resulting in a prediction as its output. Initial steps towards enhancing the interpretability and explainability of machine learning models used in clinical applications started with the introduction of Dave et al. (2020, Holzinger et al. (2017) and Tjoa and Guan (2021). The main objective is to make these models more transparent and understandable for both ML engineers and medical practitioners. (Wang et al., 2021) developed a model combined with XGBoost, yielding a significant improvement in diagnosing anterior mediastinal masses (MCs and MTs) with a 97.2% accuracy rate. For smaller lesions, the model outperformed radiologists by achieving an accuracy rate of 83.5%. The study underlines the challenges related to the interpretability of complex radiomics models. ElShawi et al. (2021) proposed four measures for evaluating interpretability techniques in machine learning. The study compared six popular techniques, LIME, SHAP, Anchors, LORE, ILIME, and MAPLE, on real-world healthcare data. Results showed variations in performance across metrics and data types, highlighting the need for specifying interpretability focus and understanding the strengths and weaknesses of each technique. Pai et al. (2021) developed a predictive model for identifying ICU patients with bloodstream infections using five machine-learning algorithms on 30 clinical variables. The XGBoost and random forest models performed well, with key predictors being alkaline phosphatase and central venous catheter period. Further validation through clinical trials is recommended. Knapič et al. (2021) examined the potential of XAI methods for decision support in medical image analysis, focusing on in vivo gastric images from video capsule endoscopy. The study found limitations in evaluating the effectiveness of explanations with non-medical users and suggested further evaluation with domain experts. Alorf (2021) explored the feasibility of CNNs for distinguishing COVID-19 infections from other pulmonary conditions in radiography images. The CNNs showed high sensitivity and specificity, but further training and testing with diverse image sources are needed for practical implementation.

Müller et al. (2022) evaluated the application of XAI techniques in the context of in vitro diagnostic (IVD) devices, introducing the concept of 'causability' as an evaluation of usability in assessing XAI explanation quality. The study underscored the potential value of XAI in glaucoma diagnosis through image analysis. Sarp et al. (2023) developed an XAI model for detecting and interpreting COVID-19 positive Chest X-Ray (CXR) images using transfer learning and data augmentation. Das et al. (2023) addressed interpretability and dimensionality in heart disease classification using XAI and SHAP with four models. XGBoost showed a 2% increase in accuracy over existing methods, marking the first attempt to explain XGBoost's heart disease diagnosis using these techniques. Pattepu et al. (2023) presented a novel paradigm in non-terrestrial networks (NTN) using the XAI approach, optimizing the relationship between signal-to-noise ratio and neighbour nodes, as demonstrated through mathematical formulations and simulations for smart healthcare. Gaube et al. (2023) found that providing explanations with predictions improved physicians' diagnostic accuracy and quality rating, particularly for non-task experts. Future studies could explore the impacts of further complexity and differing explanations. Bienefeld et al. (2023) explored the differing views of developers and clinicians on XAI in healthcare, underscoring the necessity of incorporating both developer and clinician perspectives when designing XAI systems. A summary of the existing research works on machine learning and XAI for healthcare (with their limitations) is presented in Table 1.

Table 1 Summary of recent studies with ML and XAI in healthcare

2.2 Stress and XAI

In the past few years, numerous studies have been conducted with the goal of detecting stress by the measurement of physiological markers. These included circumstances in which the participants were required to deliver a speech in front of an audience, perform mental computations, or endure uncomfortable physiological conditions (Gjoreski et al., 2016; Hovsepian et al., 2015; Picard et al., 2001). The HRV analysis of electrocardiogram data has been used in a significant amount of research for stress analysis. An electrocardiogram (ECG) may be used to assess a person's heart rate variability (HRV), which can then be used to determine how stressed a person is Ramteke and Thool (2017), Rigas et al. (2012) and Tanev et al. (2014). Based on HRV, Delaney and Brodie (2000) explored how the heart reacts to short-term psychological stress to find out how it responds. An HRV feature-based transformation strategy was used by Wang et al. (2013) on the Physionet driver database with a K-nearest neighbor (KNN) classifier to identify stress.

Traditional machine learning methods, such as the Random Forest, were used to solve a problem with three classes (no stress, medium stress, and severe stress) and achieved an accuracy of 72% (Gjoreski et al., 2016). Schmidt et al. (2018) were able to train a stress classification model with a precision of 92.28% by using 67 features derived from 7 sensor modalities. Using the same dataset set, Bobade and Vani (2020) employed Deep Neural Networks (DNN) and 40 statistical features to acquire a 95.21% accuracy rate. Aqajari et al. (2020) trained a stress classification model using EDA, obtaining an accuracy of 92% by combining statistical features and a representation acquired by a deep learning model. Hsieh et al. (2019) were motivated by the success achieved by the XGBoost algorithm in training by using EDA data to derive features in the time, entropy, frequency, and wavelet domains.

Ham et al. (2017) extracted HRV features and used LDA to find and classify the exact stress levels. As a result, there was a high degree of accuracy in classifying people into three different groups: no stress, mild stress, and highly stressed. Zangróniz et al. (2018) introduced a method that can classify mental distress by a tree-based classifier, revealing underlying complementarity that improves the discriminating model's accuracy by 82.35%. Lawanont et al. (2019) proposed a system that uses an IoT architecture to build the stress recognition model, achieving an accuracy of 81.70% using DT. Zubair and Yoon (2020) used different classifiers based on quadratic discriminant analysis (QDA) and Support Vector Machine (SVM) and were able to identify five levels of mental stress with an accuracy of 94.33%. Moridani et al. (2020) showed that HRV features could be used to differentiate between stress and non-stress stages by using the convolutional neural network (CNN) and obtained an average classification rate of 97.9% for cognitive stress and 94.5% for emotional stress. Pourmohammadi and Maleki (2020) used an innovative combination of feature selection with SVM that yielded an accuracy of stress identification of 100%, 97.6%, and 96.2% across levels of two, three, and four respectively. Rodríguez-Arce et al. (2020) used KNN to measure the student's level of anxiety with the State-Trait Anxiety Inventory (STAI) and found that the physiological feature subset can best explain the difference between stress and anxiety states. Sánchez-Reolid et al. (2020) made 147 participants watch a series of video clips depicting tense and relaxed situations. These clips were intended to evoke certain emotions. It achieved an F1-score of 83% with SVM and 92% with D-SVM. Zalabarria et al. (2020) classified stressed and relaxed states by employing a 20-s sliding window protocol to the Fuzzy algorithm, which yielded an F1 score of 91.15% and 96.61% for stressed and relaxed, respectively. Zainudin et al. (2021) employed an IoT sensor to gather data regarding a real-life mental health scenario and obtained the best classification accuracy of 96% with DT. Deep learning is utilised in order to address an ECG-based stress detection issue (Seo et al., 2019). A driver stress detection network was proposed by Rastgoo et al. (2019), which utilized a multi-modal fusion of CNN and LSTM, achieving an accuracy of 92.8%. Uddin et al. (2022) employed ANN to predict depressive symptoms in a large textual sample based on people's online behaviour, yielding an accuracy of 95%. Table 2 summarises the recent studies in stress classification by various ML/DL algorithms with their limitations. Most of the said models developed for stress detection have shown good performance in controlled settings; however, they may not perform well in real-world scenarios (Ham et al., 2017; Lawanont et al., 2019; Rodríguez-Arce et al., 2020). They also lack interpretability, making it difficult to understand the reasoning behind the model's predictions and identify any errors in the model.

Table 2 Summary of recent studies on stress classification

2.3 Compression (pruning and quantization) of deep neural networks

Wearable technologies are situated at the outermost boundaries of a network and frequently engage in direct interactions with users or the physical environment. In order for artificial intelligence models to function in real-time on such devices, it is imperative to optimise them for reduced latency, minimal power consumption, and constrained storage capacities. In general, machine learning models, specifically deep neural networks, exhibit a parameter count ranging from millions to billions. The considerable level of intricacy frequently contributes to enhanced precision, yet it also leads to increased model sizes and prolonged inference durations. The computational cost and energy consumption associated with large models on edge devices can pose significant barriers. The main goal of pruning is to reduce the resource demands of the model while maintaining its performance at a satisfactory level, resulting in a lighter, faster, and more efficient model.

Pruning techniques have the potential to alleviate these issues. Pruning is a technique that entails the elimination of less significant parameters or neurons from a model, resulting in a reduction of its overall complexity. Various techniques can be employed to enhance the performance of neural networks. These techniques range from basic weight pruning, which involves removing neurons with weights below a specific threshold, to more advanced approaches, such as utilising L1 or L2 regularisation. The latter method promotes sparsity in the model's weights during the training process. The advantage of pruning is its ability to substantially reduce the size of the model, thereby enabling its compatibility with devices that possess restricted storage capacity. The decrease in the size of the model can also lead to a reduction in the inference time, thereby facilitating faster real-time responses. This aspect holds significant importance for numerous applications on edge devices. For example, the utilisation of a pruned model can facilitate expedited object recognition on a smartphone camera or enhance the efficiency of anomaly detection in an Internet of Things (IoT) sensor network.

Additionally, the process of pruning has the potential to decrease the energy demands of the model. Energy efficiency is of utmost importance, especially in the context of battery-operated devices. A pruned model necessitates a reduced number of computations, thereby resulting in decreased energy consumption.

There have been several proposed techniques for compressing and speeding up neural networks. Tensor factorization is a computational method that decomposes the weights of a neural network into smaller, more manageable components. The decomposition of a 3 × 3 convolutional filter into one 1 × 3 and one 3 × 1 filters has been demonstrated by Jaderberg et al. (2014). Previous studies have employed truncated singular value decomposition (SVD) as a means to accelerate fully connected layers (Denton et al., 2014; Girshick, 2015; Xue et al., 2013). Quantisation (Rastegari et al., 2016) offers an alternative strategy for mitigating computational complexity. This method involves the representation of floating-point values using a reduced number of bits, thereby conserving resources while maintaining an acceptable level of precision. Zhang et al. (2018) proposed a compact network design, entails making alterations to the convolutional structure.

Pruning techniques, on the other hand, primarily centre around the reduction of network complexity through the elimination of connections. Han et al. (2015) propose an iterative strategy for constructing a sparse network by eliminating connections that possess weights below a predetermined threshold. However, it frequently faces practical performance challenges related to cache and memory access. In order to tackle this issue, several studies (Fernandes & Yen, 2021; He, 2022; Hu et al., 2016; Liang et al., 2021) have suggested the reduction of redundant connections at the filter level.

Dong et al. (2022) used pruning to compress a deep neural network (DNN) model and found that pruning the DNN model resulted in a significant reduction in model size without a significant loss in performance. Abbasi-Asl and Yu (2021) used pruning to compress a convolutional neural network, and the model resulted in a significant reduction in model size and improved classification accuracy. Recently, many pruning strategies (Blalock et al., 2020; He et al., 2017; Pasandi et al., 2020) for automatically compressing DNNs have been presented. However, they either neglect rich topological information by relying on rules or embeddings that are manually created and ignoring it, or they do not take topology changes into account when they compress the models. A pruning strategy designed for one DNN cannot be transferred to another DNN, which is why every network requires a strategy that is specifically tailored to it. Table 3 summarizes various pruning methods used in deep neural networks.

Table 3 Summary of various neural network compression methods

Several studies have used machine learning and deep learning algorithms to classify stress levels based on HRV features, with high accuracy achieved in controlled settings. However, the lack of interpretability and the performance in real-world scenarios remain a limitation. Recent research has focused on compressing deep neural networks through pruning strategies, resulting in a significant reduction in model size without a significant loss in performance. However, current pruning strategies have limitations and require tailoring to specific networks. A notable point of divergence in our study lies in the capacity to autonomously ascertain the network architecture, specifically the optimal number of preserved channels at each layer.

This paper addresses the gap in the field of stress detection by proposing a novel pruning method based on graph neural networks. The method combines pruning and quantisation into a single process to achieve a smaller model size while maintaining competitive performance. This is achieved by using the SHAP method to aid in feature selection and network pruning, resulting in reduced sparsity by a factor of 60% with minimal loss in accuracy. The proposed method helps to better identify effective contributing features and establish an effective compression strategy, thus making it an innovative approach in the field of stress detection using US-HRV data.

3 Methodology

The methodology used in the study is outlined in Fig. 1 After preprocessing the data, SHAP was used to identify the major contributing features in the dataset by submitting it to a generic classification using Graph Convolutional Network. Once the key features were identified and ranked, a 2-stage model compression method was applied, which included model pruning followed by weight quantisation. The filtered model with fewer parameters and lower computational complexity was then used for classification, which divided the data into two categories: those with stress ideations and those without stress ideations.

Fig. 1
figure 1

Proposed methodology

3.1 Feature identification using SHAP

SHAP is a method for interpreting the output of machine learning models (Lundberg et al., 2020). It is based on the concept of Shapley values from cooperative game theory, which provides a way to fairly distribute a value among a group of individuals based on their contributions. In the context of machine learning, SHAP values can be used to explain the contribution of each feature to the model's output. SHAP values can provide a way to understand which features are most important in a model's predictions and how different features interact with each other to affect the prediction.

In our work, the filtered ECG signal was used to calculate several statistical measures, including Mean RR and standard deviation. The absolute values of these characteristics were calculated using the SHAP feature contribution values. Absolute power was also used to determine the peak frequency along each axis. SHAP method is used to understand the importance and contribution of each feature to the output of the model, in this case, the statistical measures of the filtered ECG signal like Mean RR and standard deviation. The absolute values are used to get the magnitude of the contribution of each feature rather than the direction.

3.2 Classification using graph convolutional network

A graph convolutional network, or GCN, is a type of neural network that works with graphs. A GCN takes a graph \(G = \left( {V,E} \right)\) as input where the graph is denoted by the equation \(G = \left( {V,E} \right)\), in which \(V\) represents the set of nodes and \(\left| V \right| = n\), and \(E\) represents the set of edges. In addition to the adjacency matrix \({\mathbf{A}}\), which is responsible for representing the structure of the graph, another matrix called \({\mathbf{X}}\) is provided as input. This matrix is used for storing the feature descriptions of the nodes; more specifically, each node \(v_{i}\) is provided with a vector called \({\mathbf{x}}_{i} \in {\mathbb{R}}^{f}\), where f is the number of features that are given as input.

Each layer is made up of an N feature matrix, where each row is a node feature. Using the propagation rule f, these features are added together at each layer to make the features of the next layer. GCNs can be trained from end to end, which means they can be trained in a supervised or unsupervised way, depending on the task to be done. They are also built to determine the new embedding state by utilising the structure of the graph in addition to the characteristics of the nodes and edges, and they achieve this by using a method that is iterative and aggregates the quality of the neighbourhoods that are next to one another. When all the information is combined, the final embedding state can be used to gain information. We use a GCN for classification as it is best suited for studies relating to signal-processing tasks.

3.3 Network pruning

Graph convolutional networks are a type of deep learning model that is designed to operate on graph-structured data. As the size of the graphs used in these applications increases, so does the size of the GCN model required to process them. This can lead to issues with computational complexity, memory usage, and model interpretability. One approach to addressing these issues is to use model compression techniques such as pruning. Pruning is a technique used to remove the redundant and insignificant parameters of a neural network model to make it more efficient and smaller in size. It helps to improve the computational efficiency of the model without significantly affecting its accuracy. The bulk of the training parameters, including weights and biases, are stored in the convolutional layers of the graph convolutional network. These layers are responsible for the learning process. The complexity of the computation is doubled when weights are multiplied, but the addition of bias to each neuron only adds one more to the total complexity of the problem. Pruning consists of three steps: training a big model, reducing weights, and fine-tuning the weights (see Fig. 2).

Fig. 2
figure 2

GCN pruning

In the context of GCNs, pruning typically involves removing connections between nodes in the graph, as well as removing entire nodes and their associated weights. One common approach is to use magnitude-based pruning, which involves setting a threshold value for the weights and removing those that fall below it. This can be done iteratively, with the model being retrained after each round of pruning until the desired level of compression is achieved. Pruning has been shown to be effective in reducing the size and computational complexity of GCN models while maintaining or even improving their performance.

Let

$$\begin{aligned} & f \sim {\mathcal{G}\mathcal{P}}\left( {\mu \left( \cdot \right),k\left( { \cdot , \cdot } \right)} \right) \\ & \mu \left( \theta \right) = {\mathbb{E}}\left[ {f\left( \theta \right)} \right] \\ & k\left( {\theta ,\theta^{\prime}} \right) = {\mathbb{E}}\left[ {\left( {f\left( \theta \right) - \mu \left( \theta \right)} \right)\left( {f\left( {\theta^{\prime}} \right) - \mu \left( {\theta^{\prime}} \right)} \right)} \right]. \\ \end{aligned}$$

where

  • \(f\) represents a function, and \(\sim GP\left( {\mu \left( \cdot \right),k\left( { \cdot , \cdot } \right)} \right)\) indicates that the function \(f\) is modelled as a Gaussian process with a mean function \(\mu \left( \cdot \right)\) and a covariance function \(k\left( { \cdot , \cdot } \right)\).

  • \(\mu \left( \theta \right)\) represents the mean function of the Gaussian process at input \(\theta\).

  • \({\mathbb{E}}\left[ {f\left( \theta \right)} \right]\) denotes the expected value of the function \(f\) at input \(\theta\) according to the Gaussian process.

Given \({{\varvec{\Theta}}} = \left\{ {\theta_{1} ,\theta_{2} , \ldots ,\theta_{n} } \right\}\) and function evaluations \(f\left( {{\varvec{\Theta}}} \right) = \left\{ {f\left( {\theta_{1} } \right),f\left( {\theta_{2} } \right), \ldots ,f\left( {\theta_{n} } \right)} \right\}\), the posterior belief of \(f\) at a novel candidate \(\hat{\theta }\) is given by

$$\begin{aligned} & \tilde{f}\left( {\hat{\theta }} \right) \sim {\mathcal{N}}\left( {\tilde{\mu }_{f} \left( {\hat{\theta }} \right),{\tilde{\Sigma }}_{f}^{2} \left( {\hat{\theta }} \right)} \right) \\ & \tilde{\mu }_{f} \left( {\hat{\theta }} \right) = \mu \left( {\hat{\theta }} \right) + k\left( {\hat{\theta },{{\varvec{\Theta}}}} \right)k({{\varvec{\Theta}}},{{\varvec{\Theta}}})^{ - 1} \left( {f\left( {{\varvec{\Theta}}} \right) - \mu \left( {{\varvec{\Theta}}} \right)} \right) \\ & {\tilde{\Sigma }}_{f}^{2} \left( {\hat{\theta }} \right) = k\left( {\hat{\theta },\hat{\theta }} \right) - k\left( {\hat{\theta },{{\varvec{\Theta}}}} \right)k({{\varvec{\Theta}}},{{\varvec{\Theta}}})^{ - 1} k\left( {{{\varvec{\Theta}}},\hat{\theta }} \right). \\ \end{aligned}$$

where

  • \(\tilde{f}\left( {\hat{\theta }} \right)\) is the function value at the novel candidate input \(\hat{\theta }\).

  • \(\mu \left( {\hat{\theta }} \right)\) is the mean of the Gaussian process at the novel candidate input \(\hat{\theta }\).

  • \(\tilde{\mu }_{f} \left( {\hat{\theta }} \right)\) is the mean of the Gaussian process at the novel candidate input \(\hat{\theta }\) after considering the known function evaluations.

  • \({\tilde{\Sigma }}_{f}^{2} \left( {\hat{\theta }} \right)\) represents the variance of the Gaussian process at the novel candidate input/\(\hat{\theta }\).

  • \(k\left( {\hat{\theta },{{ \Theta }}} \right)\) and \(k\left( {{\Theta },\hat{\theta }} \right)\) are covariance vectors representing the covariances between the novel candidate input \(\hat{\theta }\) and the existing inputs in the set \({\Theta }\).

Let \(\theta^{ + }\) be the best candidate evaluated so far. The expected improvement (\({\text{EI}}\)) of a candidate \(\left( {\hat{\theta }} \right)\) is defined as the expected increase in the function value compared to the best candidate evaluated so far \(\left( {\theta^{ + } } \right)\). The \({\text{EI}}\) can be computed efficiently in closed form and is used as a criterion to choose the next candidate for evaluation.

$${\text{EI}}\left( {\hat{\theta }} \right) = {\mathbb{E}}\left[ {m \left\{ {0,f\left( {\theta^{ + } } \right) - \tilde{f}\left( {\hat{\theta }} \right)} \right\}} \right]$$
$$\begin{aligned} & {\text{EI}}\left( {\hat{\theta }} \right) = {\tilde{\Sigma }}_{f} \left( {\hat{\theta }} \right)\left( {Z{\Phi }\left( Z \right) + \phi \left( Z \right)} \right) \\ & Z = \frac{{\tilde{\mu }_{f} \left( {\hat{\theta }} \right) - f\left( {\theta^{ + } } \right)}}{{{\tilde{\Sigma }}_{f} \left( {\hat{\theta }} \right)}} \\ \end{aligned}$$

where

  • \({\text{EI}}\left( {\hat{\theta }} \right)\) denotes the Expected Improvement at the novel candidate input \(\hat{\theta }\). It represents the expected increase in the function value compared to the best candidate evaluated so far.

  • \(Z,{\Phi }\left( Z \right)\), and \(\phi \left( Z \right)\) are standard normal random variables, in which \(Z\) is a random variable, \({\Phi }\left( Z \right)\) is the standard normal cumulative distribution function, and \(\phi \left( Z \right)\) is the standard normal probability density function.

  • \(\theta^{ + }\) represents the best candidate evaluated so far.

We prune the graph convolutional network by trimming the weights depending on their size, which is often expressed as a percentage ranging from 0 to 100%. To do this, we rank the weights for each of the three layers independently and set the bottom % weights to zero. This effectively serves the connections between those neurons. It takes into account the mean and variance of the posterior belief and assigns a higher score to candidates that are expected to result in significant improvements.

3.4 Reducing and modifying weights using quantisation

Quantisation is the process of reducing the number of levels or values that a signal or data can take on so as to reduce the memory and computational requirements of a model by reducing the precision of the model's parameters. In the process of quantisation, the GCN's parameters are grouped into a finite number of intervals or bins, and each parameter is replaced by the value that corresponds to the bin it falls in. By doing this, the parameter's feasible values become limited, leading to a decrease in the memory and computational demands of the model.

The tuning process starts by initialising the parameters of the baseline model to zero. The first training is used to determine which connections between neurons should be severed before further training can begin. We use an approach that includes zeroing and fine-tuning each convolutional layer in sequence rather than pruning and fine-tuning each layer of the network separately.

4 Results and discussions

4.1 Datasets description

For this study, we have used WESAD and SWELL-KW datasets that are publicly available and offer a wealth of physiological and motion data collected from wearable sensors, providing valuable information for the study of stress and affective states in individuals.

The WESAD (Wearable Stress and Affect Detection) (Schmidt et al., 2018) dataset is a publicly available dataset that contains physiological and motion data collected from wearable sensors worn by participants while they engage in a variety of activities, including baseline measurements, stress-induction tasks, and affective computing tasks. The dataset includes data from 15 participants, including 7 females and 8 males, who were between the ages of 22 and 35. The data was collected using a variety of wearable sensors, including a chest-strap heart rate monitor, a wrist-worn accelerometer, and a wrist-worn electrodermal activity sensor. The dataset includes both raw sensor data and preprocessed data, as well as labels for stress and affective states. The size of the WESAD dataset is about 3.8 GB, with approximately 10 h of data collected from each participant.

The SWELL-KW (Saskia et al., 2014) dataset is a multi-modal dataset obtained from an experiment in which 25 people conducted knowledge work tasks under various stressors (email interruptions and time pressure). The dataset contains computer logging, facial expression, body posture, heart rate variability, skin conductance, and validated questionnaire responses from subjects. The size of the SWELL dataset is about 5.4 GB, with approximately 3 h of data collected from each participant.

4.2 Data preprocessing

Preprocessing is necessary before conducting a US-HRV study to eliminate outliers in the RR intervals caused by noise, such as movement. Data that is more than three standard deviations (SD) away from the mean is considered an outlier in RR interval data. A non-linear interpolation method, called cubic spline interpolation, was chosen to process the HRV signal. The sensor data was then further analysed using a sliding window with a shift of 0.25 s. To calculate ECG features, a five-second window was used, which is a standard practice in acceleration-based context recognition. All physiological characteristics were calculated using a 60-s frame, except for the statistical and frequency domain ECG features. The size of this window was chosen based on the suggestions of Kreibig (2010). We followed a fivefold cross-validation on the data with a train test split ratio of 80:20.

The raw ECG data was processed by first applying a high-pass filter to eliminate the DC component. The filtered signal was then divided into periods of 5 s, and statistical significance and peak frequency were calculated. Power spectral density was determined using seven frequency bands evenly spaced from 0 to 350 Hz. After this second round of processing, the raw ECG data was further processed by applying a low-pass filter. The processed signal was then divided into periods of 60 s, on which various peak characteristics, such as the overall number of EMG peaks and their mean amplitude, were calculated. Peak detection was done using the variations that showed a significant contribution to the ECG. These peaks were used to calculate the average heart rate (HR) and HR variability.

The following seven US-HRV measures reported in Table 4 were analysed for this study.

Table 4 US-HRV measures and values

A 60-s measurement revealed significant variations across groups using Mean RR and LF characteristics. When comparing long-term changes, the high-stress group did not show smaller variances in any of the Mean_RR or LF measures. Analysis of 2-min HRV samples showed a significant decrease in Mean_RR, a significant increase in LF, and a stable LF/HF ratio. Non-linear assessments of heart rate variability indicated a reduction in acute mental stress.

4.3 Classification analysis

Table 5 presents the results of various classifier algorithms applied to the stress condition classification task. The classifiers include K-Nearest Neighbour (Wang et al., 2013), Multi-Layer Perceptron (Zainudin et al., 2021), Support Vector Machine (Zubair & Yoon, 2020), Convolutional Neural Network (Moridani et al., 2020), Deep Neural Network (Zainudin et al., 2021), Artificial Neural Network (Uddin et al., 2022), and Linear Discriminant Analysis (Ham et al., 2017). The metrics used to evaluate the performance of the classifiers are recall, precision, accuracy, and F1-score.

Table 5 Classification before applying the pruning methods

Based on the results shown in Table 5, for the WESAD dataset, the proposed method has a Recall of 97.2%, which is higher than the other classifiers, indicating that it is able to identify a high percentage of the relevant samples correctly. Furthermore, the proposed method has a Precision of 98.42%, which is also higher than the other classifiers, indicating that a high percentage of the samples that it identifies as relevant are actually relevant. This results in a high Accuracy of 98.84%, which is also higher than the other classifiers. Furthermore, the F1-Score of 96.48% is also the highest among all the classifiers, which indicates a good balance of precision and recall.

The proposed method has performed well on the SWELL dataset, with high scores across various performance metrics. The Recall score of 95.32% indicates that the proposed model correctly identifies a significant proportion of relevant samples, while the Precision score of 94.22% shows that the proposed model accurately identifies relevant samples from those it flags. The model also achieves a high Accuracy score of 95.74%, indicating it can classify most samples with high accuracy. The high F1-Score of 96.37% is also noteworthy, suggesting the method has achieved a good balance between precision and recall.

4.4 Feature identification and explainability using SHAP

The probabilistic value associated with each contributing factor in both WESAD and SWELL datasets are identified using SHAP, and unimportant weights are removed. In this scenario, we loop through the named modules in the model and check if the module is a convolution layer. If it is a convolution layer, then we extract the weights, create a binary mask indicating which weights are not zero, and multiply the weights by the mask to remove the unimportant weights. Thus, we create an instance of the GCN model with input features of size 10, hidden features of size 5, and output features of size 2 and use the function to prune the weights in the model.

As we can see from Fig. 3, the major contributors to the prediction come from the values of Median RR and Mean RR. Thus, we evaluate the model taking the Mean RR value as a reference point as its contribution is strictly in the range of desired value (0.2–0.3 for WESAD and 0.09–0.3 for SWELL), and with reference to it, we find the LF/HF ratio. The findings of this study, using Mean RR as a reference point, show that levels of LF and the LF/HF ratio significantly increase during stressful situations. This study correlates with the existing knowledge that the increased activity of the autonomic nervous system is strongly linked to changes in HRV values recorded during stress (Evans et al., 2013; Pham et al., 2021). This is evidenced by the marked increase in LF and the significant decrease in Mean RR. Non-linear HRV measures, such as sample entropy or fractal dimension, are commonly used to quantify HRV. A reduction in these non-linear HRV measures can indicate an increase in stability and regularity in heart rate variability patterns. This phenomenon is often observed during stressful conditions when the body shifts towards more stable and periodic HRV behaviour. This shift is related to the deactivation of control loops in the cardiovascular system, which regulate the heart rate. Thus, stress can have a significant impact on the autonomic regulation of the heart, as evidenced by changes in HRV measures.

Fig. 3
figure 3

Analysis of feature contribution via Shapley values: (A), (B) on WESAD, (C) on SWELL

4.5 Pruning and quantisation analysis

In general, pruning is the process of reducing the size of the weights by a certain percentage, which can range from 0 to 1. This is achieved by setting the lowest ranking weights in each of the three layers to zero, effectively cutting off communication between those neurons. We present an improved version of traditional pruning methods, where we use 0.5 as a threshold to prune weights and then retrain the network to recover the lost accuracy as the first step. Then, we iteratively prune and retrain the network until a sparsity of 60% is reached.

To optimise the model's performance, we use quantisation, which involves zeroing and fine-tuning each convolutional layer sequentially instead of pruning and fine-tuning them individually. The process starts by ranking the weights in the first layer and setting (and freezing) the required percentage of them to zero. Then, we move on to the next convolutional layer and repeat the process of zeroing off the necessary proportion of its parameters before fine-tuning for all convolutional layers.

We evaluate the performance of the model by measuring the overall accuracy, F1 score, loss, and sensitivity for different levels of sparsity for Convolutional Neural Networks with pruning (Moridani et al., 2020), Deep Neural Networks with pruning (Zainudin et al., 2021), Artificial Neural Networks with pruning (Uddin et al., 2022) and our proposed GCN with pruning and quantisation. The results are presented in Table 6. The proposed GCN with pruning and quantisation performed the best, with a recall of 96.2% and an accuracy of 97.75% on the WESAD dataset, while having a recall of 92.15% and an accuracy of 94.48% on the SWELL dataset.

Table 6 Classification values after applying pruning and quantisation

The efficacy of our model can be attributed to a multitude of sophisticated technical implementations. Firstly, the model has the capability to independently determine the most suitable network architecture, specifically the optimal number of preserved channels at each layer, in order to enhance efficiency and achieve task-specific performance. Additionally, the model incorporates advanced pruning techniques to remove redundant connections at the filter level, resulting in decreased computational complexity and enhanced efficiency and speed. Furthermore, the methodology incorporates tensor factorization techniques to decompose the weights of the network into smaller and more manageable components. Additionally, it applies quantization methods to conserve computational resources while minimising the loss of precision. The proposed model also introduces a modified convolutional structure that utilises group-wise convolution to enhance the processing of high-dimensional data, leading to improved efficiency and accuracy in the outcomes.

The overall accuracy of the model is affected by the sparsity level, which is the percentage of weights that are set to zero. The results show that pruning with quantisation is the most effective method for maintaining high accuracy up to a sparsity level of around 60% to 70%. The fine-tuning stage helps to improve sensitivity and maintain high accuracy even at higher sparsity levels.

4.6 Results analysis and discussion

The evolution of wearable technology in recent years has heralded a transformative era in health monitoring, with stress detection—a pervasive health parameter affecting numerous individuals globally—at the forefront of this innovation. This study's primary objective was to effectively leverage physiological signals derived from these wearable devices to accurately identify and quantify stress levels. In a novel approach, the study developed a machine learning model utilising the dynamic capabilities of GCN, concurrently integrating pruning and quantisation methodologies to elevate computational efficiency—a critical element when applied to resource-limited wearable devices.

Our study harnessed data from the well-regarded WESAD and SWELL datasets, renowned as comprehensive repositories in stress detection research, hosting a diverse spectrum of physiological signals. These rich datasets enabled the cultivation of a holistic understanding of varied bodily responses elicited during stress episodes. The subsequent training and evaluation of our model on these datasets generated inspiring results.

The GCN model's performance was not just encouraging but strikingly effective, achieving precision rates of 97.75% on the WESAD dataset and 94.48% on the SWELL dataset. This potent performance, corroborated by an accuracy range of approximately 95% to 98%, underscores the model's robust predictive capabilities.

Our model's performance metrics extended beyond precision and accuracy, evidencing robust levels of recall, too. Precision, measuring the exactitude of positive predictions, was recorded as 94.42% on the WESAD dataset and 93.45% on the SWELL dataset. Notably, recall or sensitivity, gauging the model's capacity to identify true positives accurately, exhibited impressive results of 96.2% and 92.15% on the WESAD and SWELL datasets, respectively. These results testify to the model's proficiency in accurately detecting stress instances while minimising false negatives and positives effectively.

Further performance evaluation involved a meticulous examination of the Receiver Operating Characteristic (ROC) and Precision-Recall curves. The Area Under the Receiver Operating Characteristic curve (AUC-ROC) provided a robust performance metric, considering sensitivity and specificity. Prior to the application of pruning and quantisation, the model achieved AUC-ROC scores of 0.996 on the WESAD dataset and 0.992 on the SWELL dataset. Post-application, the scores were marginally affected, registering at 0.994 and 0.986, thereby suggesting a minimal impact on performance. Similarly, the area under the precision-recall curve (AUC-PR), particularly valuable in the context of imbalanced datasets, demonstrated comparable trends.

Positioned alongside previous studies, our GCN model exhibited a commendable, potentially superior performance, excelling in terms of accuracy, precision, recall, and F1-score. However, the distinguishing merit of our model lies in its enhanced computational efficiency, achieved via the integrated pruning and quantisation techniques. This approach successfully reduced the model size by a substantial average of 60%, simultaneously improving processing time by 45%. Consequently, this innovative methodology provides a more feasible and efficient solution for implementation in wearable devices, heralding a new paradigm in stress detection technology.

4.7 Inferences

In addition to accuracy and sensitivity, power consumption is also an important consideration when implementing machine learning and deep learning models in real-world applications. The complexity of the model, measured in terms of floating-point operations per second (FLOPs), directly affects power consumption. The results show that as the sparsity level increases, the model's complexity decreases, resulting in significant power savings. For example, at a sparsity level of 63.4%, the base model's complexity drops from 1.46 million FLOPs to 0.57 million FLOPs for the WESAD dataset and sparsity level of 63.4%, the base model's complexity drops from 1.56 million FLOPs to 0.67 million FLOPs for SWELL dataset, resulting in a 60% to 70% decrease in complexity and a corresponding reduction in power consumption. The F1 score for the pruning approach at this sparsity level is 97.66%, and the accuracy is 97.75% for WESAD, and the F1 score for the pruning approach at this sparsity level is 94.39%, and the accuracy is 94.48% for SWELL. Overall, these results show that by carefully adjusting the sparsity level, it is possible to achieve a good trade-off between accuracy, sensitivity, and power consumption.

5 Conclusion

This study presents a novel iterative pruning with a quantisation approach for identifying mental stress in real-time using Ultra-short Heart Rate Variability measurements. The proposed approach uses a graph convolutional network model to classify US-HRV measurements with a high degree of accuracy and efficiency. As the GCN model's complexity can be a challenge for real-time applications, especially when deployed on resource-constrained devices, this study proposed a multi-stage pruning technique for GCN models that reduces their complexity while maintaining virtually all of their performance. The results show that the proposed method can classify US-HRV with a high degree of accuracy and efficiency, and the runtime complexity is decreased by ~ 60% compared to the initial model.

Notwithstanding the encouraging outcomes, it is imperative to recognise certain constraints within our research. The study primarily utilised a limited number of datasets for both the training and evaluation of the models. This statement may not comprehensively encompass the entirety of physiological reactions to stress, as they can be influenced by a multitude of individual and contextual factors. In addition, the extensive range of wearable devices, each possessing distinct specifications and measurement intricacies, has the potential to impact the model's generalizability. Subsequent investigations should contemplate the inclusion of a more extensive array of datasets and the evaluation of the model's performance on various categories of wearable devices in order to augment its resilience and versatility.

Future research in the area of stress detection using ultra-short HRV and wearable sensors could involve training machine learning and deep learning algorithms on bigger and more diversified data sets to increase their generalizability, ultimately leading to more accurate and effective stress detection tools for individuals to manage their mental health and overall well-being.