Deep Residual Recurrent Neural Networks for Characterisation of Cardiac Cycle Phase from Echocardiograms

Dezaki, Fatemeh Taheri; Dhungel, Neeraj; Abdi, Amir H.; Luong, Christina; Tsang, Teresa; Jue, John; Gin, Ken; Hawley, Dale; Rohling, Robert; Abolmaesumi, Purang

doi:10.1007/978-3-319-67558-9_12

Fatemeh Taheri Dezaki²⁷,
Neeraj Dhungel²⁷,
Amir H. Abdi²⁷,
Christina Luong²⁸,
Teresa Tsang²⁸,
John Jue²⁸,
Ken Gin²⁸,
Dale Hawley²⁸,
Robert Rohling²⁷ &
…
Purang Abolmaesumi²⁷

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10553))

Included in the following conference series:

7607 Accesses
13 Citations

Abstract

Characterisation of cardiac cycle phase in echocardiography data is a necessary preprocessing step for developing automated systems that measure various cardiac parameters. Accurate characterisation is challenging, due to differences in appearance of the cardiac anatomy and the variability of heart rate in individuals. Here, we present a method for automatic recognition of cardiac cycle phase from echocardiograms by using a new deep neural networks architecture. Specifically, we propose to combine deep residual neural networks (ResNets), which extract the hierarchical features from the individual echocardiogram frames, with recurrent neural networks (RNNs), which model the temporal dependencies between sequential frames. We demonstrate that such new architecture produces results that outperform baseline architecture for the automatic characterisation of cardiac cycle phase in large datasets of echocardiograms containing different levels of pathological conditions.

F.T. Dezaki and N. Dhungel—Contributed equally.

T. Tsang is the Director of the Vancouver General Hospital and University of British Columbia Echocardiography Laboratories, and Principal Investigator of the CIHR-NSERC grant supporting this work.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Recognizing End-Diastole and End-Systole Frames via Deep Temporal Regression Network

Cardiac Arrhythmia Detection from ECG with Convolutional Recurrent Neural Networks

Going Deeper into Cardiac Motion Analysis to Model Fine Spatio-Temporal Features

Keywords

1 Introduction

According to the World Health Organization^{Footnote 1} millions of people worldwide suffer from the heart-related disease, a major cause of mortality. The 2-D echocardiography (echo) examination is a widely used imaging modality for early diagnosis. Echo can be used to estimate several cardiac parameters such as stroke volume, end-diastolic volume, and ejection fraction [1]. These parameters are generally measured by identifying end-systolic and end-diastolic frames from cine echos [1, 2]. Current detection is either manual or semi-automatic [3, 4], relying primarily on the availability of electrocardiogram (ECG) data, which can be challenging in point-of-care and non-specialized clinics. Moreover, such detection techniques add to the workload of medical personnel and are subject to inter-user variability. Automatic detection of cardiac cycle phase in echo can potentially alleviate these issues. However, such detection is challenging due to low signal-to-noise ratio in echo data, subtle differences between the consecutive frames, the variability of cardiac phases, and temporal dependencies between these frames.

There have been a number of attempts for automatic detection and labeling of frames in echo [2, 4, 5]. The most common approach is to use a segmentation strategy, such as level-sets or graph-cuts, for identifying the boundary of the left ventricle in an echo sequence [2, 5]. Boundaries that correspond to largest and smallest ventricular areas are regarded as end-diastolic (ED) and end-systolic (ES) frames [5] (Fig. 1). A major drawback of those segmentation-based approaches is the requirement for a good initialization and localization of the left ventricle. In another approach [4], information from cine frames was projected onto a low-dimensional 2-D manifold, after which the difference of distance between points on the manifold was used to determine ED and ES frames. However, this approach does not consider the complex temporal dependencies between the frames, which may subsequently lead to sub-optimal results. Recently, methods based on convolutional neural networks (CNNs) combined with recurrent neural networks (RNNs) have produced state-of-the-art results in many computer vision problems [6,7,8,9]. A combination of CNNs and RNNs has been successfully applied in various problem domains such as detecting frames from videos, natural language processing, object detection and tracking [3, 8, 10, 11]. However, our experience is that such network structure cannot be easily extended to the analysis of echo data, which suffer from low signal-to-noise ratio, where anatomical boundaries may not be as clearly visible compared to other imaging modalities.

Here, we formulate a very deep CNN architecture using residual neural networks (ResNets) [7] for extracting deep hierarchical features, which are then passed to RNNs containing two blocks of long short term memory (LSTM) [12] to decode the temporal dependencies between these features. The primary motivation of using a ResNets with LSTMs is that layers in ResNets are reformulated for learning the residual function with respect to input layers for countering the vanishing or exploding gradient problem in CNN while going deeper [7]. In addition to this, we minimize the structured loss function that mimics the temporal changes in left ventricular volume during the systolic and diastolic phase [3]. We demonstrate that the proposed method using ResNets and LSTMs with structured loss function produces state-of-the-art accuracy for detecting cardiac phase cycle in echo data without the need of segmentation.

2 Methods

2.1 Dataset

Let $\mathcal {D} = \{ (\mathbf {c}^{(i)}, \mathbf { y}^{(i)})_j \}_{j=1}^{|\mathcal {D}|}$ represent the dataset, where $i \in \{1,2,..,N\}$ indexes the number of cardiac cycles for an individual study j, $\mathbf {c} =\{\mathbf {x}_1, \mathbf {x}_2, ..,\mathbf {x}_t\}$ represents the collection of frames $\mathbf {x}$ for each patient such that $\mathbf {x}_t: \varOmega \rightarrow \mathbb {R}$ with $\varOmega \subset \mathbb {R}^{2}$, and $\mathbf {y}=\{y_1,y_2,..,y_t\}$ denotes the class label for individual frames computed using the following function [3]:

$$\begin{aligned} y_t = {\left\{ \begin{array}{ll} \left( \frac{|t-T_\text {ES}^\textit{i}|}{|T_\text {ES} ^\textit{i} -T_\text {ED}^\textit{i}|}\right) ^{\tau }, &{} \text {if}\ T_\text {ED}^\textit{i}<t<T_\text {ES}^\textit{i} \\ \\ \left( \frac{|t-T_\text {ES}^\textit{i}|}{|T_\text {ES}^\textit{i}-T_\text {ED}^\textit{i + 1}|}\right) ^{\frac{1}{\tau }}, &{} \text {if}\ T_\text {ES}^\textit{i}<t<T_\text {ED}^\textit{i+1} \end{array}\right. } \end{aligned}$$

(1)

where $T_\text {ES}^\text {i}$ and $T_\text {ED}^\text {i}$ are the locations of ES and ED frames in the $i_\text {th}$ cardiac cycle and $\tau $ is an integer constant. The expression in Eq. 1 mimics the ventricular volume changes during the systole and diastole phases of a cardiac cycle [3].

2.2 Deep Residual Recurrent Neural Networks (RRNs)

We propose deep residual recurrent neural networks (RRNs) to decode the phase information in echo data. RRNs constitute of a stack of residual units for extracting the hierarchical spatial features from echo data, RNNs that use LSTMs for decoding the temporal information, and a fully connected layer at the end. As shown in Fig. 3(a), each residual unit is made by the addition of a residual function and an identity mapping, and can be expressed in a general form by the following expression [7]:

$$\begin{aligned} \mathbf {x}_{L}^{(t)} = \mathbf {x}_{l}^{(t)} + \sum _{l=1}^{L-1}f_\text {RES}(\mathbf {x}_{l}^{(t)}; \mathcal {W}_{l}), \end{aligned}$$

(2)

where $\mathbf {x}_l^{(t)}$ is the input feature to the $l\in \{1,...,L\}^{th}$ residual unit (for $l=1,~\mathbf {x}_l = \mathbf {x}_t$), $\mathcal {W}_l = {\mathbf {w}_{l,k}}$ is the set of weights for the $l^{th}$ residual unit, $k\in \{1,...,K\}$ represents the numbers of layers in the residual unit, $f_\text {RES}(.)$ is called the residual function represented by a convolutional layer (weight) [6, 13], batch normalization (BN) [14] and rectilinear unit (ReLU) [7, 15] (Fig. 3(a)).

We use RNNs for decoding the temporal dependencies within the echo sequence. LSTM network as shown in Fig. 3(b) is a type of RNNs with the addition of memory cell that allows the network to learn to remember or forget the hidden states. In particular, LSTM updates the hidden states $\mathbf {h}_t$ and its memory units $\mathbf {c}_t$ with given sequential input $\mathbf {x}_t$ using the following expressions [8]:

$$\begin{aligned} \begin{aligned} \mathbf {i}_t&= \phi \left( \mathbf {w}_{xi}\mathbf {x}_t + \mathbf {w}_{hi}\mathbf {h}_{t-1} + \mathbf {b}_i\right) ;\\ \mathbf {f}_t&= \phi \left( \mathbf {w}_{xf}\mathbf {x}_t + \mathbf {w}_{hf}\mathbf {h}_{t-1} + \mathbf {b}_f\right) ;\\ \mathbf {o}_t&= \phi \left( \mathbf {w}_{xo}\mathbf {x}_t + \mathbf {w}_{ho}\mathbf {h}_{t-1} + \mathbf {b}_o\right) ;\\ \end{aligned} \end{aligned}$$

(3)

$$\begin{aligned} \begin{aligned} \mathbf {c}_t =\mathbf {f}_t\odot \mathbf {c}_{t-1}+&\mathbf {i}_t\odot \psi \left( \mathbf {w}_{xi}\mathbf {x}_t + \mathbf {w}_{hi}\mathbf {h}_{t-1} + \mathbf {b}_i\right) ,\\&\mathbf {h}_t =\mathbf {o}_t\odot \psi (\mathbf {c}_{t}),\\ \end{aligned} \end{aligned}$$

(4)

where $\odot $ is element-wise product, $\psi $ is the hyperbolic tangent function, $\phi $ is the sigmoid function, and $\mathbf {w}_{xi}$, $\mathbf {w}_{hi}$, and $\mathbf {b}_i$ are the weights and biases between input $\mathbf {i}_t$ and hidden state $\mathbf {h}_{t-1}$; $\mathbf {w}_{xf}$, $\mathbf {w}_{hf}$, and $\mathbf {b}_f$ are the weights and biases between forget state $\mathbf {f}_t$ and hidden state $\mathbf {h}_{t-1}$; and $\mathbf {w}_{xo}$, $\mathbf {w}_{ho}$, and $\mathbf {b}_o$ are the weights and biases between output $\mathbf {o}_t$ and hidden state $\mathbf {h}_{t-1}$. Our proposed RRNs (Fig. 2) can be expressed as the following function:

$$\begin{aligned} \left\{ \tilde{y}_1, \tilde{y}_2,..,\tilde{y}_t\right\} = f_\text {RRNs}\left( \mathbf {x}_1,\mathbf {x}_2,..,\mathbf {x}_t;\mathcal {W}_\text {RRNs}\right) , \end{aligned}$$

(5)

where $f_\text {RRNs}$ takes the sequential echo images as input to RRNs and outputs the predicted values $\tilde{y}_t$ with parameters of RRNs as $\mathcal {W}_\text {RRNs}=[\mathcal {W}_\text {ResNets},\mathcal {W}_\text {LSTM}, \mathbf {w}_\text {fc}]$. $\mathcal {W}_\text {ResNets}$ represents the parameters of ResNets, $\mathcal {W}_\text {LSTM}$ represents the parameters of LSTM, and $\mathbf {w}_\text {fc}$ is associated with weights of the final fully connected layer. We train the proposed RRNs in an end-to-end fashion with stochastic gradient decent to minimise the following loss function [3] containing a first term as $L^2$ norm and a second term as structured loss (for notation simplicity, we dropped index i denoting the echo sequences):

$$\begin{aligned} \begin{aligned} \ell (\mathcal {W}_\text {RRNs}) = \sum _{j=1}^{|\mathcal {D}|}\bigg [\frac{\alpha }{T}\sum _{t=1}^{T}||y_{(j,t)}-\tilde{y}_{(j,t)}||^2 + \frac{2\beta }{T} \sum _{t=2}^{T}\big (\mathbb {I}(y_{(j,t)}>y_{(j,t-1)}) \\ \max (0,\tilde{y}_{(j,t-1)}-\tilde{y}_{(j,t)})+\mathbb {I}(y_{(j,t)}<y_{(j,t-1)})\max (0,\tilde{y}_{(j,t)}-\tilde{y}_{(j,t-1)})\big )\bigg ], \end{aligned} \end{aligned}$$

(6)

where $\mathbb {I}$ denotes the indicator function, T is the maximum frame length, and $\alpha , \beta $ are user defined variables which are cross validated during the experiment. The structured loss term takes into account monotonically decreasing nature of the cardiac volume changes during the systole phase and monotonically increasing nature of cardiac volume changes in the diastolic phase. Finally, inference in the purposed method is done in a feed-forward way using the learned model.

3 Experiments

We carried out the experiments on a set of 2-D apical 4 chambers (AP4) cine echoes obtained at Vancouver General Hospital (VGH) with ethics approval from the Clinical Medical Research Ethics Board of the Vancouver Coastal Health (VCH) (H13-02370). The dataset used in this project consists of 1,868 individual patient studies containing a range of pathological conditions. The data were obtained using different types of ultrasound probes from various manufacturers, which consists of both normal and abnormal cases. Experiments were run by randomly dividing these cases into mutually exclusive subsets, such that 60% of the cases were available for training, 20% for validation, and 20% for the test. The average length of echo sequence in each study was about 30 frames. The electrocardiogram (ECG) signals, synchronized with cine clips, were also available. We considered the ECG signal to generate the ground-truth labels for ED and ES frames by identifying R and end of T points in that signal. Subsequently, the intermediate labels for all frames in each echo sequence were derived using Eq. (1), where the value of $\tau $ is selected to be 3 and labels are normalized in the range of [0–1] (Note that 0 represents ES and 1 represents ED). The experiment was conducted on the image resolution of $120\times 120$, where we sub-sample the original image resolution using bi-cubic interpolation.

Our proposed RRNs, shown in Fig. 2, takes as input a cardiac sequence containing 30 frames irrespective of its location in the cardiac phase. Each frame is passed through the convolutional layer (weights) plus a ReLU, where the convolutional layer contains eight filters of size $3\times 3$ followed by nine subsequent residual units. Each residual unit is made up of batch normalization (BN) plus ReLU plus weights. Each convolutional layer in the first three residual units contained the same eight filters (size $3\times 3$), the fourth, fifth and sixth residual units contained 16 filters of size $3\times 3$, and the seventh, eighth and ninth units had 32 filters of size $3\times 3$. In the second to the last layer, we concatenated the 32 output features from each ResNet to form 192 features ($32\times 6$), followed by two blocks of LSTM layer and a final fully connected layer containing 30 neurons, to generate a label for each frame of the cardiac cycle. For comparison, we also used the shallow CNN model, Zeiler-Fergus (ZF), [3] in combination with different loss functions. The performance of our approach is calculated based on three metrics, the coefficient of determination ($R^2$ score), average absolute frame detection error, and computation time of each sample. $R^2$ score is defined by $R^2 = 1-\frac{\sum (y_t-\tilde{y}_t)^2}{\sum (y_t-\bar{y})^2}$, where $\bar{y}_t$ is the mean of true labels. Average absolute frame detection error is calculated as $Err_S= \frac{1}{|\mathcal {D}|}\sum |T_\text {S} -\widetilde{T}_\text {S}|$, where S is either the ED or the ES frame. All experiments were performed on a system with an Intel(R) Core(TM) i7-2600k 3.40 GHz$\times $8 CPU with 8 GB RAM equipped with the NVIDIA GeForce GTX 980Ti graphics card.

4 Results and Discussion

Table 1 shows the result of the proposed approach using two different loss functions: a) $L_2$ loss, and b) $L_2$ loss + structured loss [3]. The $R^2$ values for these two settings are 0.36 and 0.66, respectively. Similarly, ($Err_\text {ED}$, $Err_\text {ES}$) for these two settings are (4.4, 4.7) and (3.7, 4.1), respectively. The baseline method [3] using CNN with the same setting produces the $R^2$ score of 0.13, whereas the average absolute frame detection error for both ED and ES error is 6.3 and 7.3 in setting (a), and 6.4 and 7.3 in setting (b).

Results in Table 1 show that the proposed RRN method outperforms the baseline method in both settings with a large margin. The advantage of the proposed method lies in its ability to go deeper with 126 layers in ResNets compared to the shallow architecture in the baseline model (CNN). Furthermore, we note that performance of the proposed RRNs model improves with the use of structured loss with higher $R^2$ score and lower frame detection error compared to the use of $L_2$ loss only. This is due to fact that structured loss introduces a structured constraint in the loss function which helps with smoothing the predicted labels, thus increasing the overall accuracy. It is also important to note that $R^2$ score of our proposed model is far better than that of baseline model (0.66 vs. 0.13), indicating that our approach is a better function approximation in the global context (with respect to assigning the correct label to individual frames). The method takes only 80 ms to characterise each cardiac cycle showing its efficiency in terms of computation cost. Figure. 4 shows some visual results of frame characterisation in a sequence of 30 frames as a function of labels represented in Eq. (1). In particular, Fig. 4(a and b) shows cases where the first frame in the sequence starts as an ED frame, and Fig. 4(c and d) shows cases where there is a shift in the location of the ED frame. Similarly, in Fig. 5, we also show one fairly accurate visual example of detection of ED and ES frames using the proposed method in the test set.

Table 1. Performance of the proposed and state-of-the-art methods on the test set.

Full size table

5 Conclusion and Future Works

In this paper, we proposed a deep residual recurrent neural net (RRN) for automated identification of cardiac cycle phases (ED and ES) from echocardiograms. We also showed that the proposed method produces results that outperform a baseline method using CNN with a large margin in detecting ED and ES frames. We achieved the $R^2$ score of 0.66 and the average absolute frame detection error of 3.7 and 4.1 for ED and ES, respectively. Our results suggest that the method has the potential to be used in clinical setting and is robust to all sorts of the pathological condition of the patient. In future, we plan to use the method as a pre-processing step for assessing several cardiac function parameters, including the ejection fraction.

Notes

1.
Global status report on noncommunicable diseases, 2014.

References

Barcaro, U., Moroni, D., Salvetti, O.: Automatic computation of left ventricle ejection fraction from dynamic ultrasound images. Pattern Recogn. Image Anal. 18(2), 351 (2008)
Article Google Scholar
Abboud, A.A., Rahmat, R.W., et al.: Automatic detection of the end-diastolic and end-systolic from 4D echocardiographic images. JCS 11(1), 230–240 (2015)
Google Scholar
Kong, B., Zhan, Y., Shin, M., Denny, T., Zhang, S.: Recognizing end-diastole and end-systole frames via deep temporal regression network. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9902, pp. 264–272. Springer, Cham (2016). doi:10.1007/978-3-319-46726-9_31
Chapter Google Scholar
Gifani, P., Behnam, H., Shalbaf, A., Sani, Z.A.: Automatic detection of end-diastole and end-systole from echocardiography images using manifold learning. Physiol. Meas. 31(9), 1091 (2010)
Article Google Scholar
Darvishi, S., Behnam, H., Pouladian, M., Samiei, N.: Measuring left ventricular volumes in two-dimensional echocardiography image sequence using level-set method for automatic detection of end-diastole and end-systole frames. Res. Cardiovasc. Med. 2(1), 39 (2013)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE CVPR, pp. 770–778 (2016)
Google Scholar
Donahue, J., Anne Hendricks, L., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: IEEE CVPR, pp. 2625–2634 (2015)
Google Scholar
Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., Xu, W.: CNN-RNN: a unified framework for multi-label image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2285–2294 (2016)
Google Scholar
Sundermeyer, M., Schlüter, R., Ney, H.: LSTM neural networks for language modeling. In: Interspeech, pp. 194–197 (2012)
Google Scholar
Milan, A., Rezatofighi, S.H., Dick, A., Reid, I., Schindler, K.: Online multi-target tracking using recurrent neural networks, arXiv preprint arXiv:1604.03635 (2016)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
LeCun, Y., Bengio, Y.: Convolutional networks for images, speech, and time series. In: Arbib, M.A. (ed.) Handbook of Brain Theory and Neural Networks, vol. 3361. MIT Press (1995)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift, arXiv preprint arXiv:1502.03167 (2015)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: ICML, pp. 807–814 (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

The University of British Columbia, Vancouver, BC, Canada
Fatemeh Taheri Dezaki, Neeraj Dhungel, Amir H. Abdi, Robert Rohling & Purang Abolmaesumi
Vancouver General Hospital’s Cardiology Laboratory, Vancouver, BC, Canada
Christina Luong, Teresa Tsang, John Jue, Ken Gin & Dale Hawley

Authors

Fatemeh Taheri Dezaki
View author publications
You can also search for this author in PubMed Google Scholar
Neeraj Dhungel
View author publications
You can also search for this author in PubMed Google Scholar
Amir H. Abdi
View author publications
You can also search for this author in PubMed Google Scholar
Christina Luong
View author publications
You can also search for this author in PubMed Google Scholar
Teresa Tsang
View author publications
You can also search for this author in PubMed Google Scholar
John Jue
View author publications
You can also search for this author in PubMed Google Scholar
Ken Gin
View author publications
You can also search for this author in PubMed Google Scholar
Dale Hawley
View author publications
You can also search for this author in PubMed Google Scholar
Robert Rohling
View author publications
You can also search for this author in PubMed Google Scholar
Purang Abolmaesumi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Fatemeh Taheri Dezaki or Purang Abolmaesumi .

Editor information

Editors and Affiliations

University College London, London, United Kingdom
M. Jorge Cardoso
McGill University, Montreal, Québec, Canada
Tal Arbel
University of Adelaide, Adelaide, South Australia, Australia
Gustavo Carneiro
IBM Research - Almaden, San Jose, California, USA
Tanveer Syeda-Mahmood
Universidade do Porto, Porto, Portugal
João Manuel R.S. Tavares
IBM Research - Almaden, San Jose, USA
Mehdi Moradi
University of Queensland, Brisbane, Queensland, Australia
Andrew Bradley
Tel Aviv University, Tel Aviv, Israel
Hayit Greenspan
Universidade Estadual Paulista, Bauru, Brazil
João Paulo Papa
Case Western Reserve University, Cleveland, Ohio, USA
Anant Madabhushi
Instituto Superior Técnico, Lisboa, Portugal
Jacinto C. Nascimento
Universidade do Porto, Porto, Portugal
Jaime S. Cardoso
University of Oxford, Oxford, United Kingdom
Vasileios Belagiannis
University of South Australia, Adelaide, South Australia, Australia
Zhi Lu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dezaki, F.T. et al. (2017). Deep Residual Recurrent Neural Networks for Characterisation of Cardiac Cycle Phase from Echocardiograms. In: Cardoso, M., et al. Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support . DLMIA ML-CDS 2017 2017. Lecture Notes in Computer Science(), vol 10553. Springer, Cham. https://doi.org/10.1007/978-3-319-67558-9_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-67558-9_12
Published: 09 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67557-2
Online ISBN: 978-3-319-67558-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics