Keywords

1 Introduction

The electrocardiogram (ECG) is a monitoring tool widely used to evaluate the heart status of patients and provide information on cardiac electrophysiology. Developing automated analysis systems capable of detecting and identifying abnormal signals is crucial in light of the importance of ECGs in medical diagnosis and the need to ease the workload of clinicians. However, training a classifier on labeled ECGs that focus on specific diseases may not recognize new abnormal statuses that were not encountered during training, given the diversity and rarity of cardiac diseases [8, 16, 23]. On the other hand, anomaly detection, which is trained only on normal healthy data, can identify any potential abnormal status and avoid the failure to detect rare cardiac diseases [10, 17, 21].

The current anomaly detection techniques, including one-class discriminative approaches [2, 14], reconstruction-based approaches [15, 30], and self-supervised learning-based approaches [3, 26], all operate under the assumption that models trained solely on normal data will struggle to process anomalous data and thus the substantial drop in performance presents an indication of anomalies. While anomaly detection has been widely used in the medical field to analyze medical images [12, 24] and time-series data [18, 29], detecting anomalies in ECG data is particularly challenging due to the substantial inter-individual differences and the presence of anomalies in both global rhythm and local morphology. So far, few studies have investigated anomaly detection in ECG  [11, 29]. TSL [29] uses expert knowledge-guided amplitude- and frequency-based data transformations to simulate anomalies for different individuals. BeatGAN [11] employs a generative adversarial network to separately reconstruct normalized heartbeats instead of the entire raw ECG signal. While BeatGAN alleviates individual differences, it neglects the important global rhythm information of the ECG.

This paper proposes a novel multi-scale cross-restoration framework for ECG anomaly detection and localization. To our best knowledge, this is the first work to integrate both local and global characteristics for ECG anomaly detection. To take into account multi-scale data, the framework adopts a two-branch autoencoder architecture, with one branch focusing on global features from the entire ECG and the other on local features from heartbeat-level details. A multi-scale cross-attention module is introduced, which learns to combine the two feature types for making the final prediction. This module imitates the diagnostic process followed by experienced cardiologists who carefully examine both the entire ECG and individual heartbeats to detect abnormalities in both the overall rhythm and the specific local morphology of the signal [7]. Each of the branches employs a masking and restoration strategy, i.e., the model learns how to perform temporal-dependent signal inpainting from the adjacent unmasked regions within a specific individual. Such context-aware restoration has the advantage of making the restoration less susceptible to individual differences. During testing, anomalies are identified as samples or regions with high restoration errors.

To comprehensively evaluate the performance of the proposed method on a large number of individuals, we adopt the public PTB-XL database [22] with only patient-level diagnoses, and ask experienced cardiologists to provide signal point-level localization annotations. The resulting dataset is then introduced as a large-scale challenging benchmark for ECG anomaly detection and localization. The proposed method is evaluated on this challenging benchmark as well as on two traditional ECG anomaly detection benchmarks [6, 13]. The experimental results have shown that the proposed method outperforms several state-of-the-art methods for both anomaly detection and localization, highlighting its potential for real-world clinical diagnosis.

2 Method

In this paper, we focus on unsupervised anomaly detection and localization on ECGs, training based on only normal ECG data. Formally, given a set of N normal ECGs denoted as \(\{x_i,i=1,...,N\}\), where \(x_i \in \mathbb {R}^D\) represents the vectorized representation of the i-th ECG consisting of D signal points, the objective is to train a computational model capable of identifying whether a new ECG is normal or anomalous, and localize the regions of anomalies in abnormal ECGs.

Fig. 1.
figure 1

The multi-scale cross-restoration framework for ECG anomaly detection.

2.1 Multi-scale Cross-restoration

In Fig. 1, we present an overview of our two-branch framework for ECG anomaly detection. One branch is responsible for learning global ECG features, while the other focuses on local heartbeat details. Our framework comprises four main components: (i) masking and encoding, (ii) multi-scale cross-attention module, (iii) uncertainty-aware restoration, and (iv) trend generation module. We provide detailed explanations of each of these components in the following sections.

Masking and Encoding. Given a pair consisting of a global ECG signal \(x_{g}\in \mathbb {R}^D\) and a randomly selected local heartbeat \(x_{l}\in \mathbb {R}^d\) segmented from \(x_{g}\) for training, as shown in Fig. 1, we apply two random masks, \(M_g\) and \(M_l\), to mask \(x_{g}\) and \(x_{l}\), respectively. To enable multi-scale feature learning, \(M_l\) is applied to a consecutive small region to facilitate detail restoration, while \(M_g\) is applied to several distinct regions distributed throughout the whole sequence to facilitate global rhythm restoration. The masked signals are processed separately by global and local encoders, \(E_g\) and \(E_l\), resulting in global feature \(f^{in}_g = E_g (x_{g}\odot M_g)\) and local feature \(f^{in}_l = E_l (x_{l}\odot M_l)\), where \(\odot \) denotes the element-wise product.

Multi-scale Cross-attention. To capture the relationship between global and local features, we use the self-attention mechanism [20] on the concatenated feature of \(f^{in}_g\) and \(f^{in}_l\). Specifically, the attention mechanism is expressed as \(Attention(Q,K,V)=\textrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V\), where QKV are identical input terms, while \(\sqrt{d_k}\) is the square root of the feature dimension used as a scaling factor. Self-attention is achieved by setting \(Q=K=V=concat(f^{in}_g,f^{in}_l)\). The cross-attention feature, \(f_{ca}\), is obtained from the self-attention mechanism, which dynamically weighs the importance of each element in the combined feature. To obtain the final outputs of the global and local features, \(f^{out}_g\) and \(f^{out}_l\), containing cross-scale information, we consider residual connections: \(f^{out}_g = f^{in}_g + \phi _g(f_{ca}),\ f^{out}_l = f^{in}_l + \phi _l(f_{ca})\), where \(\phi _g (\cdot )\) and \(\phi _l (\cdot )\) are MLP architectures with two fully connected layers.

Uncertainty-Aware Restoration. Targeting signal restorations, features of \(f^{out}_g\) and \(f^{out}_l\) are decoded by two decoders, \(D_g\) and \(D_l\), to obtain restored signals \(\hat{x}_g\) and \(\hat{x}_l\), respectively, along with corresponding restoration uncertainty maps \(\sigma _g\) and \(\sigma _l\) measuring the difficulty of restoration for various signal points, where \(\hat{x}_g, \sigma _g = D_g(f^{out}_g),\ \hat{x}_l, \sigma _l = D_l(f^{out}_l)\). An uncertainty-aware restoration loss is used to incorporate restoration uncertainty into the loss functions,

$$\begin{aligned} \mathcal {L}_{global}=\sum _{k=1}^D \{\frac{(x_{g}^{k}-\hat{x}_{g}^{k})^2}{\sigma ^{k}_{g}}+\log \sigma ^{k}_{g}\}, \ \ \mathcal {L}_{local}=\sum _{k=1}^d \{\frac{(x_{l}^{k}-\hat{x}_{l}^{k})^2}{\sigma ^{k}_l}+\log \sigma ^{k}_{l}\}, \end{aligned}$$
(1)

where for each function, the first term is normalized by the corresponding uncertainty, and the second term prevents predicting a large uncertainty for all restoration pixels following [12]. The superscript k represents the position of the k-th element of the signal. It is worth noting that, unlike [12], the uncertainty-aware loss is used for restoration, but not for reconstruction.

Trend Generation Module. The trend generation module (TGM) illustrated in Fig. 1 generates a smooth time-series trend \(x_t \in \mathbb {R}^D\) by removing signal details, which is represented as the smooth difference between adjacent time-series signal points. An autoencoder (\(E_t\) and \(D_t\)) encodes the trend information into \(E_t(x_t)\), which are concatenated with the global feature \(f^{out}_g\) to restore the global ECG \(\hat{x}_{t} = D_t(\text {concat}(E_t(x_{t}),f^{out}_g))\). The restoration loss is defined as the Euclidean distance between \(x_{g}\) and \(\hat{x}_{t}\), \(\mathcal {L}_{trend}=\sum _{k=1}^D (x_{g}^{k}-\hat{x}_{t}^{k})^2.\) This process guides global feature learning using time-series trend information, emphasizing rhythm characteristics while de-emphasizing morphological details.

Loss Function. The final loss function for optimizing our model during the training process can be written as

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{global} + \alpha \mathcal {L}_{local}+\beta \mathcal {L}_{trend}, \end{aligned}$$
(2)

where \(\alpha \) and \(\beta \) are trade-off parameters weighting the loss function. For simplicity, we adopt \(\alpha = \beta = 1.0\) as the default.

2.2 Anomaly Score Measurement

For each test sample x, local ECGs from the segmented heartbeat set \(\{x_{l,m},m=1,...,M\}\) are paired with the global ECG \(x_{g}\) one at a time as inputs. The anomaly score \(\mathcal {A}(x)\) is calculated to estimate the abnormality,

$$\begin{aligned} \mathcal {A}(x)= \sum _{k=1}^D \frac{(x_{g}^{k}-\hat{x}_g^{k})^2}{\sigma ^{k}_g} + \sum _{m=1}^M \sum _{k=1}^{d} \frac{(x^{k}_{l,m}-\hat{x}_{l,m}^{k})^2}{\sigma ^{k}_{l,m}} + \sum _{k=1}^D (x_{g}^{k}-\hat{x}_{t}^{k})^2, \end{aligned}$$
(3)

where the three terms correspond to global restoration, local restoration, and trend restoration, respectively. For localization, an anomaly score map is generated in the same way as Eq. (3), but without summing over the signal points. The anomalies are indicated by relatively large anomaly scores, and vice versa.

3 Experiments

Datasets. Three publicly available ECG datasets are used to evaluate the proposed method, including PTB-XL [22], MIT-BIH [13], and Keogh ECG [6].

  • PTB-XL database includes clinical 12-lead ECGs that are 10 s in length for each patient, with only patient-level annotations. To build a new challenging anomaly detection and localization benchmark, 8167 normal ECGs are used for training, while 912 normal and 1248 abnormal ECGs are used for testing. We provide signal point-level annotations of 400 ECGs, including 22 different abnormal types, that were annotated by two experienced cardiologists. To our best knowledge, we are the first to explore ECG anomaly detection and localization across various patients on such a complex and large-scale database.

  • MIT-BIH arrhythmia dataset divides the ECGs from 44 patients into independent heartbeats based on the annotated R-peak position, following [11]. 62436 normal heartbeats are used for training, while 17343 normal and 9764 abnormal beats are used for testing, with heartbeat-level annotations.

  • Keogh ECG dataset includes 7 ECGs from independent patients, evaluating anomaly localization with signal point-level annotations. For each ECG, there is an anomaly subsequence that corresponds to a pre-ventricular contraction, while the remaining sequence is used as normal data to train the model. The ECGs are partitioned into fixed-length sequences of 320 by a sliding window with a stride of 40 during training and 160 during testing.

Evaluation Protocols. The performance of anomaly detection and localization is quantified using the area under the Receiver Operating Characteristic curve (AUC), with a higher AUC value indicating a better method. To ensure comparability across different annotation levels, we used patient-level, heartbeat-level, and signal point-level AUC for each respective setting. For heartbeat-level classification, the F1 score is also reported following [11].

Table 1. Anomaly detection and anomaly localization results on PTB-XL database. Results are shown in the patient-level AUC for anomaly detection and the signal point-level AUC for anomaly localization, respectively. The best-performing method is in bold, and the second-best is underlined.
Table 2. Anomaly detection results on MIT-BIH dataset, comparing with state-of-the-arts. Results are shown in terms of the AUC and F1 score for heartbeat-level classification. The best-performing method is in bold, and the second-best is underlined.
Table 3. Anomaly localization results on Keogh ECG [6] dataset, comparing with several state-of-the-arts. Results are shown in the signal point-level AUC. The best-performing method is in bold, and the second-best is underlined.

Implementation Details. The ECG is pre-processed by a Butterworth filter and Notch filter [19] to remove high-frequency noise and eliminate ECG baseline wander. The R-peaks are detected with an adaptive threshold following [5], which does not require any learnable parameters. The positions of the detected R-peaks are then used to segment the ECG sequence into a set of heartbeats.

We use a convolutional-based autoencoder, following the architecture proposed in [11]. The model is trained using the AdamW optimizer with an initial learning rate of 1e-4 and a weight decay coefficient of 1e-5 for 50 epochs on a single NVIDIA GTX 3090 GPU, with a single cycle of cosine learning rate used for decay scheduling. The batch size is set to 32. During testing, the model requires 2365M GPU memory and achieves an inference speed of 4.2 fps.

3.1 Comparisons with State-of-the-Arts

We compare our method with several time-series anomaly detection methods, including heartbeat-level detection method BeatGAN [11], patient-level detection method TSL [29], and several signal point-level anomaly localization methods [1, 4, 9, 18, 25, 27, 28, 30]. For a fair comparison, we re-trained all the methods under the same experimental setup. For those methods originally designed for signal point-level tasks only [1, 9, 18, 25, 30], we use the mean value of anomaly localization results as their heartbeat-level or patient-level anomaly scores.

Fig. 2.
figure 2

Anomaly localization visualization on PTB-XL with different abnormal types. Ground truths are highlighted in red boxes on the ECG data, and anomaly localization results for each case, compared with the state-of-the-art method, are attached below. (Color figure online)

Anomaly Detection. The anomaly detection performance on PTB-XL is summarized in Table 1. The proposed method achieves 86.0% AUC in patient-level anomaly detection and outperforms all baselines by a large margin (10.3%). Table 2 displays the comparison results on MIT-BIH, where the proposed method achieves a heartbeat-level AUC of 96.9%, showing an improvement of 2.4% over the state-of-the-art BeatGAN (94.5%). Furthermore, the F1-score of the proposed method is 88.3%, which is 6.7% higher than BeatGAN (81.6%).

Anomaly Localization. Table 1 presents the results of anomaly localization on our proposed benchmark for multiple individuals. The proposed method achieves a signal point-level AUC of 74.7%, outperforming all baselines (3.2% higher than BeatGAN). It is worth noting that TSL, which is not designed for localization, shows poor performance in this task. Table 3 shows the signal point-level anomaly localization results for each independent individual on Keogh ECG. Overall, the proposed method achieves the best or second-best performance compared to other methods on six subsets and the highest mean AUC among all subsets (74.9%, 2.5% higher than BeatGAN), indicating its effectiveness. The proposed method shows a lower standard deviation (\(\pm 10.5\)) across the seven subsets compared to TranAD (\(\pm 11.3\)) and BeatGAN (\(\pm 11.0\)), which indicates good generalizability of the proposed method across different subsets.

Anomaly Localization Visualization. We present visualization results of anomaly localization on several samples from our proposed benchmark in Fig. 2, with ground truths annotated by experienced cardiologists. Regions with higher anomaly scores are indicated by darker colors. Our proposed method outperforms BeatGAN in accurately localizing various types of ECG anomalies, including both periodic and episodic anomalies, such as incomplete right bundle branch block and premature beats. Our method though provides narrower localization results than ground truths, as it is highly sensitive to abrupt unusual changes in signal values, but still represents the important areas for anomaly identification, a fact confirmed by experienced cardiologists.

Table 4. Ablation studies on PTB-XL dataset. Factors under analysis are: the masking and restoring (MR), the multi-scale cross-attention (MC), the uncertainty loss function (UL), and the trend generation module (TGM). Results are shown in the patient-level AUC in % of five runs. The best-performing method is in bold.
Table 5. Sensitivity analysis w.r.t. mask ratio on PTB-XL dataset. Results are shown in the patient-level AUC of five runs. The best-performing method is in bold, and the second-best is underlined.

3.2 Ablation Study and Sensitivity Analysis

Ablation studies were conducted on PTB-XL to confirm the effectiveness of individual components of the proposed method. Table 4 shows that each module contributes positively to the overall performance of the framework. When none of the modules were employed, the method becomes a ECG reconstruction approach with a naive L2 loss and lacks cross-attention in multi-scale data. When individually adding the MR, MC, UL, and TGM modules to the baseline model without any of them, the AUC values improve from 70.4% to 80.4%, 80.3%, 72.8%, and 71.2%, respectively, demonstrating the effectiveness of each module. Moreover, as the modules are added in sequence, the performance improves step by step from 70.4% to 86.0% in AUC, highlighting the combined impact of all modules on the proposed framework.

We conduct a sensitivity analysis on the mask ratio, as shown in Table 5. Restoration with a 0% masking ratio can be regarded as reconstruction, which takes an entire sample as input and its target is to output the input sample. Results indicate that the model’s performance first improves and then declines as the mask ratio increases from 0% to 70%. This trend is due to the fact that a low mask ratio can limit the model’s feature learning ability during restoration, while a high ratio can make it increasingly difficult to restore the masked regions. Therefore, there is a trade-off between maximizing the model’s potential and ensuring a reasonable restoration difficulty. The optimal mask ratio is 30%, which achieves the highest anomaly detection result (86.0% in AUC).

4 Conclusion

This paper proposes a novel framework for ECG anomaly detection, where features of the entire ECG and local heartbeats are combined with a masking-restoration process to detect anomalies, simulating the diagnostic process of cardiologists. A challenging benchmark, with signal point-level annotations provided by experienced cardiologists, is proposed, facilitating future research in ECG anomaly localization. The proposed method outperforms state-of-the-art methods, highlighting its potential in real-world clinical diagnosis.