1 Introduction

Fifteen years ago, people proposed BAN (Body Area Network) which is a key infrastructure element for patient-centered medical applications. At that time, people expected that technologies will enable people to carry their personal BAN which provides medical, lifestyle, assisted living, sports or entertainment functions for the user by the year 2010 [18] where this importance is still expected. However, sensor technologies, embedded systems, wireless communication technologies, miniaturization and AI has achieved greatly success, and it make many smart elements into the lifestyle. Many sensors and devices are invented in recent years, and wearable devices are popular in recent years [24]. Healthcare devices play an important role in health monitoring for modern societies [5, 11, 32], especially for aging portion. These devices make continuous monitoring of inhabitants be possible even without hospitalization. Moreover, many technologies are applied in wearable devices [30]. All the physiological signals as well as physical activities of the patient are possible to be monitored with the help of wearable sensors [1, 10, 31]. Actually, sensors are relatively easy to produce. The key technologies are relatively difficult to be solved. In this paper, deep learning is used to detect the OSA by ECG signal. We believe that the wearable technologies will impact future medical technology, affecting our health and fitness decisions, redefining the doctor-patient relationship and reducing healthcare cost greatly. Figure 1 gives the potential application scenario. Family member can monitor the patient by wireless communication. We can use many sensors. In this paper, we mainly focus on the analysis and recognition for ECG signal.

Fig. 1
figure 1

The typical application of smart surveillance and monitoring

In Medical science, the lack of breathing at the time of sleeping is called hypopnea whereas the complete silence in breathing is called apnea. An instance when one has either a difficulty in breathing or complete silence of breath during sleeping time, which varies in time and frequency is called OSA. These two form of sleeping disorder are caused because of various reasons. One reason is the pharyngeal collapse during sleep, which leads to choking, intense snoring, sudden and frequent awakening and disrupt in sleeping.

Recent studies suggest that 4% of men and 2% of women of age more than 50 years are suffering from symptomatic OSA [11, 29]. Additionally, 2% to 4% of middle-aged adults and 1% to 3% children are suffering from OSA [2]. Despite how frequent it is, most cases go undetected and can be credited to 70 billion dollars’ loss, 11.1 billion in damages and 980 deaths each year [2]. One of the traditional way to detect various sleep disorder is by using polysomnography (PSG) at a sleep lab. It records the breath air flow, movement of respiratory, oxygen saturation, a position of the body, electromyography (EMG), electroencephalography (EEG) and electrocardiogram (ECG) [9] for detection and treatment. However, this technique is expensive, unavailability of materials and inconvenience for testing, as a technician needs to process overnight.

Over the past few years, several methods have been suggested for the detection of sleep apnea. By using mean absolute amplitude (MAA), [22] studied and suggested that thoracic and abdominal signals are good constraints to detect sleep apnea. The method gains 80% accuracy and 74% sensitivity on the designated dataset. OSA detection was investigated by [3] using speech signal of ECG. The designed model used prospective patients’ speech recordings to automatically diagnose OSA, which is not reliant on phoneme recognition and segmentation. The classification scheme used non-silence segments of the patient’s speech signal, and thus better fits the hidden material in the speech signal and reveals the vocal tract’s dynamics.

Related to our work, various traditional neural network methods [3, 20, 25, 37] of obstructive sleep apnea detection have been highly studied. For example, [3] proposed OSA detection using neural network (NN) classification of time-frequency strategy of the heart rate variability. The method used textures features extracted from normalized gray level co-occurrence matrices of the image obtained by short time discrete Fourier transform. The extracted features are used as an input for three-layer multilayer perceptron for detection. [20] proposed NN based feature selection and identification of OSA and gained up to 70% classification accuracy. In this OSA detection scheme, the NN was used for two purposes: one is to choose the optimal frequency bands that can be used for identification at the time of feature extraction and the other is used for detection during the feature matching stage. Also, [2] exploited support vector machine on ECG signals to detect OSA. The method trained on a subject of both OSA and non-OSA for training and testing the model and obtained up to 96.5% classification accuracy. All the aforementioned NN-based methods only consider the shallow networks especially for classification of OSA from ECG recordings. However, it has been verified that deep networks [8, 14, 15, 19] have no comparison in providing good classification results than shallow networks.

In this paper, taking the advantage of CNN over image recognition and classification, we introduce an efficient framework for OSA detection based on convolutional neural network by considering sleep Apnea-ECG recordings. First, we extract features from the Apnea-ECG recordings using RR-intervals and then the extracted RR-intervals are used as an input for the designed CNN’s model. The designed model has three convolution layers with the first two convolution layers are followed by batch normalization and maxpooling layers. The third convolution layer is followed by three fully connected layers, where the last fully connected layer is connected with softmax classifier for the final decision. Details of our method are explained in Section 3. Figure 2 shows our model architecture details.

Fig. 2
figure 2

General topology of our model the first part is sleep apnea ECG signal, next extracted feature and then convolutional layers, and finally fully connected layers

The rest of this paper is organized as follows. Section 2 provides some highlighted description about neural networks with CNN as primary concepts. The detail of proposed model is explained in Section 3. Section 4 presents the experimental results. Finally, Section 5 concludes the paper.

2 Neural network

Neural networks are a computational model [4, 6, 7, 36] employed in computer science and other research areas, which is built on a huge group of simple neural elements (artificial neurons), loosely similar to the observed behavior of a biological brain’s axons. Each neural element is related with several others, and associations can enhance the activation state of adjoining neural elements. The objective of the neural network is to resolve difficulties in the similar way that the human brain would. For input x1, x2,..., xn training samples, each individual neural unit is computed as,

$$ Output =f\left( \sum\limits_{i = 1}^{N}w_{i} x_{i}+b\right) $$
(1)

Where wi and b denotes weight and bias, respectively, initially they could be a random number and later learned by the model itself, f represents an activation function such as, sigmoid and ReLu functions, and N denotes the number of training samples. A threshold function called biases on each connection and on the element itself might exist, such that the indicator must exceed the limit before propagating to other neurons. These schemes are self-learning and trained, rather than clearly programmed, and surpass in parts where feature detection problem is hard to express in a conventional computer program.

2.1 Convolutional neural networks

A feed forward neural network can be understood as a configuration of various functions

$$ f(x)=f_{k} (...,f_{2} (f_{1} (x;w_{2}); w_{2})...,w_{k}) $$
(2)

Each function fk takes xk as an input (xk can be an image or sound) and a parameter wk to produce an output xk+ 1. Though the nature and sequence of function usually handcrafted, the parameters w = (w1,..., wk) are learned from data to solve the objective problems. Initially, w can be initialized from a normal distribution with mean zero where the optimum values are learned by the model. The function f is a non-linear transformation applied on an input data x which is local and translation invariant.

2.2 Back propagation in CNNs

The parameters of a CNN, w = (w1,..., wk) should be learned in such a mode that the overall CNN function L = f(x;w) attains the desired objective.

In simple terms, for a given input-output pair association (x1, z1),...,(xn, zn) where xi is an input data and zi is a corresponding output, and \(l(z,\hat z)\) is a loss that expresses the penalty for estimating \(\hat z\) instead of z, the goal is to minimize a penalty function,

$$ L(w)=\frac{1}{N} \sum\limits_{i = 1}^{N}l(z_{i},f(x_{i};w)) $$
(3)

This can be minimized by an algorithm called gradient descent. Which means, calculate the gradient of the objective L at a present solution wt and then update the next along the track of fastest descent of L as,

$$ w^{t + 1}=w^{t}-\eta_{t} \frac{\partial f}{\partial w} (w_{t} ) $$
(4)

where ηtR+ is the learning rate.

Also, given an initialized bias corresponding to the weight, the bias will be updated as,

$$ b^{t + 1}=b^{t}-\eta_{t} \frac{\partial f}{\partial b} (b_{t} ) $$
(5)

where ηtR+ is again learning rate.

2.3 ECG data’s in sleep apnea

Heart rate, and other features of the ECG, vary in characteristic techniques in similar with sleep related breathing ailments [17]. Earlier work made use of cyclical variations of heart rate but did not found effective algorithms to quantify sleep allied breathing disorders based on heart rate only [17, 27]. It is clear that an estimation of heart rate cannot produce an apnea index or an hypopnea index. Both values are obtained by the evaluation of airflow, respiratory effort and oxygen saturation. Considering this limitation, it is proved that evaluation of ECG can provide an approximate of disturbed breathing during the night which should correspond to the results obtained by standardized apnea scoring [28]. The ECG of sleep apnea database is collected to detect sleep associated breathing disorders based on a single channel ECG recording. According to this database, all Polysomnographic recordings were scored by one expert in a diverse way. Figure 3 shows an illustration of a ECG signal of sleep apnea.

Fig. 3
figure 3

ECG signal’s of Sleep Apnea

3 Proposed model

Our proposed OSA detection based on CNN has several basic components. In this section, we give the detail explanation about each component and the overall topology of the model.

3.1 Proposed OSA detection model

The general topology of our OSA detection is illustrated Fig. 2. CNNs have been widely employed in a various application of pattern recognition [12, 33, 34] and detection [3]. Based on the size and the structure of an input data, the number of layers and nodes in the network always differs. Our proposed model is based on feed forward CNNs, which learns a predefined set of input-output example pairs. As shown in Fig. 5, our end-to-end model has the following basic sections; features extraction, convolution layer, pooling layer, batch normalization layer, ReLu layer, fully connected layer and softmax layer, which provide us a better detection results together.

Given an extracted ECG signal features, the first convolution layer applies 64 filters of size 3x3x1 and outputs 64 feature maps (with stride 1, also same over all convolution layers) where this layer is then followed by batch normalization and then with 2x2 max pooling. The second convolutional layer takes as input the output of the first convolutional layer and it filters with 64 kernels of size 3x3x64 and followed by batch normalization and then 2x2 max pooling layer. The third convolutional layer has also 64 kernels of size 3x3x64 connected to the (normalized-pooled) outputs of the second convolutional layer. Finally, the last convolution layer is followed by three fully connected (FC) layers with 100, 10 and 2 neurons, respectively, where ReLu activation function is inserted between FC100 and FC10, and FC10 and FC2. The softmax layer is finally applied on the final output. We will give the detail of these basic layers in the following subsections.

3.1.1 Feature extraction

It is common to extract a set of features from speech signals where detection is carried out on a set of features instead of the original signals themselves. In our feature extraction part, apnea ECG recordings used for training passed through feature extraction process using RR-intervals (time interval from one R-wave to the next R-wave as shown in Fig. 4. To be more specific, RR- interval is a time interval between two successive R peaks [2], which can be written in a simple mathematical equation as,

$$ RR(i)=R(i + 1)-R(i), i = 1,2,...,n-1 $$
(6)

The extracted features are then used as an input for the rests of the model’s layer.

Fig. 4
figure 4

RR interval of Apnea- ECG recording

3.1.2 Convolution layer

Given the extracted feature xi from the input data, each convolution has an outcome yi computed as,

$$ y_{i}=\sum\limits_{i = 1}^{N}wx_{i}+b $$
(7)

Where N is the number of samples, w and b denote weight and bias of the current layer respectively, where the detail of parameters initialization and optimization process is explained in Section 3.2. Figure 5 shows the architecture detail of our OSA detection model.

Fig. 5
figure 5

The model’s architecture detail

3.1.3 Batch normalization

Assume that x = {x1,..., xd} is the input to a layer with dimension d. Each dimension of x is normalized by

$$ \hat x_{k}=\frac{(x_{k}-E(x_{k}))}{\sqrt{(var[x_{k}])}} $$
(8)

where E(xk) is an expectation of xk and var[xk] is the variance of xk and they are computed over the training data. This type of normalization speed up convergence [28] even when the features are not decorrelated.

Normalizing each input in the layer may sometimes change what really the layer should represent. To address this problem, the transformation inserted in the network is chosen to be an identity transform. For this matter, a pair of parameters γk and βk for each xk have been introduced to scale and shift the normalized value as,

$$ y_{k}=\gamma_{k} \hat x_{k}+\beta_{k} $$
(9)

γk and βk are learned along with other model parameters. In such manner or formulation, convolutional neural network is benefited from data normalization. Data normalization helps the network train faster and provide higher accuracy.

3.1.4 Pooling layer

It is common to occasionally insert a Pooling layer in-between consecutive convolutional layers in a convolutional Networks architecture. Its function is to gradually shrink the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence also to control overfitting. The Pooling Layer operates independently on every depth slice of the input and resizes it spatially. There are various pooling strategies, but for our case we employed max pooling with size 2x2.

3.1.5 Softmax layer

Basically, when the ECG signal is propagated through the output layer, the final outcome is compared with the desired value and an error value using MSE or other methods is computed for all output unit. These errors are then back propagated to each node in the in-between layers. After every node in the network established an error value that defines its relative involvement to the overall error, weight and bias are updated using (4) and (5). In our method, for the final decision making of the model, we apply Softmax on the final output. In the in-between two fully-connected layers, ReLu (i.e, max(0,x)) activation function is applied and finally the last FC layer is followed by a softmax classifier. The parameters of the model are approximated by the stochastic gradient descent algorithm with the gradient computed by backpropagation algorithm to maximize the log-likelihood.

3.2 Training

To acquire the optimal parameters of the model, people are using diverse cost function (error minimization methods) for image recognition such as mean square error [13]. In our case, we adopt the widely used cost function for image detection i.e. mean square error.

Given an apnea signal of ECG f, our claim is to train a mapping M that estimates the value \(\hat x=M(f)\) where \(\hat x\) is an estimation of the target signal y from f. Hence, for M output-desired signals training dataset pairs \(\{f_{i},y_{i} \}_{i = 1}^{M}\), the optimization objective is,

$$ min_{\lambda} \frac{1}{2M}\sum\limits_{i = 1}^{M}\parallel M(f_{i};\lambda)-y_{i}{\parallel_{F}^{2}} $$
(10)

Where λ denotes the network parameters to be learned, M(fi;λ) is the estimated signal corresponding to fi. Several activation functions such as ReLu, its variation PReLu, RReLu and Elu are suggested by the researchers to have relatively better performance. But for our application, we employed ReLu activation function and the network parameters are optimized by Adam algorithm [21].

4 Experimental results

In this section, we set the details about the dataset utilized for training and testing, the way the model parameters are adjusted, parameters are learned, and finally the result of our model’s performance is given with detail description.

4.1 Training and testing data

To train and test the model, we used the data from Apnea-ECG database which is freely available at (https://physionet.org/physiobank/database/#ecg/). The database [28] has been assembled for the Physio Net/Computers in Cardiology Challenge 2000. It consists of 70 ECG signals of an individuals, each normally 8 hours long, where 35 of them are only annotated.

For our model training, we divide the annotated 35 ECG recorded apnea signals of an individuals into two categories. The first category is training set of 20 ECG apnea recording signals of an individuals that are normal and non-normal and the second category is the testing ECG apnea signals. The testing set consists 10 ECG apnea signals of an individuals which are also normal and non-normal and they are not part of training set.

After feature extraction is carried out for each dataset using (1), all ECG recordings are adjusted to matrix of size 240x240 values for training and testing. As deep learning models are benefited from large training dataset, we perform data augmentation as in [38] which is then an input for the next model layer for the overall training process.

4.2 Training parameters

To capture sufficient spatial information of the ECG recordings, we initialize the weight and bias by the method in [16] and use Adam algorithm [21] with α = 0.01, β1 = 0.9, β2 = 0.999, and 𝜖 = 10− 8. The batch size is set to 64. We have trained the model for 50 epochs. The learning rate is decayed exponentially from 0.01 to 0.0001 for the 50 epochs. We use the MatConvNet package [35] to train the proposed network. All experiments are carried out in Matlab (R2015b) in an environment running on a computer with Intel(R) Xeon(R) CPU E3-1230v3 3.30GHz and NVIDIA Tesla K40c GPU and takes four hours for training.

4.3 Performance evaluation

We evaluated the effectiveness of our model on different records of ECG signals. As a detection measure of our model, we compute the three commonly known performance measures [22] i.e, sensitivity (se), specificity (sp), and accuracy (ac), where they have the following definitions.

$$ se =\frac{{OSA}_{s}}{(OSA)_{t}}\times100\% $$
(11)
$$ sp =\frac{{NOR}_{s}}{(NOR)_{t}}\times100\% $$
(12)
$$ ac=\frac{{OSA}_{s}+{NOR}_{s}}{ (OSA)_{t} + (NOR)_{t}}\times100\% $$
(13)

Where ’OSAs’ is the number of properly detected OSA signal, ’NORs’ is the number of properly detected normal (NOR) signals, (OSA)t is a total OSA signal and (NOR)t is a total NOR signal tested [12]. Our experimental results show that the CNN based model detect OSA effectively and provides upto 97.80% accuracy with the 50th training epoch. As shown in Table 1, our detection accuracy increases as we increase training epochs from 20 to 50 in 10 increments. However, the model is not providing better accuracy values for more than 50 training epochs. Table 1 shows the performance of our model on ECG dataset at different training epochs.

Table 1 The performance result of our model

5 Conclusion and future work

Effective and efficient obstructive sleep apnea detection model is proposed in this paper. The model used the advantage of current success in convolutional neural networks in an image and an audio recognition problems. We proposed efficient deep learning based architecture having ten layers. The Apnea-ECG recording datasets are used for training and testing. Our experimental results show that the detection of OSA based on convolutional neural network is more appropriate method than the traditional neural network based. Also, the models’ accuracy, sensitivity and specificity values showed the effectiveness of our model.

In future work we are considering OSA case and plan to investigate the target problem with recurrent neural networks, with taking an accuracy, sensitivity, and specificity in account.