Abstract
Quantitative assessment of left ventricle (LV) function from cine MRI has significant diagnostic and prognostic value for cardiovascular disease patients. The temporal movement of LV provides essential information on the contracting/relaxing pattern of heart, which is keenly evaluated by clinical experts in clinical practice. Inspired by the expert way of viewing Cine MRI, we propose a new CNN module that is able to incorporate the temporal information into LV segmentation from cine MRI. In the proposed CNN, the optical flow (OF) between neighboring frames is integrated and aggregated at feature level, such that temporal coherence in cardiac motion can be taken into account during segmentation. The proposed module is integrated into the U-net architecture without need of additional training. Furthermore, dilated convolution is introduced to improve the spatial accuracy of segmentation. Trained and tested on the Cardiac Atlas database, the proposed network resulted in a Dice index of 95% and an average perpendicular distance of 0.9 pixels for the middle LV contour, significantly outperforming the original U-net that processes each frame individually. Notably, the proposed method improved the temporal coherence of LV segmentation results, especially at the LV apex and base where the cardiac motion is difficult to follow.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
1.1 Left Ventricle Segmentation
Cardiovascular disease is a major cause of mortality and morbidity worldwide. Accurate assessment of cardiac function is very important for diagnosis and prognosis of cardiovascular disease patients. Cine magnetic resonance imaging (MRI) is the current gold standard to assess the cardiac function [1], covering different imaging planes (around 10) and cardiac phases (ranging from 20 to 40).
The large number of total images (200–400) poses significant challenges for manual analysis in clinical practice, therefore computer-aided analysis of cine MRI has been actively studied for decades. Most traditional methods in literature are based on dedicated mathematical models of shape and intensity [2]. However, the substantial variations in the cine images, including the acquisition parameters, image quality, heart morphology/pathology, etc., all make it too challenging, if not impossible, for traditional image analysis methods to reach a clinically acceptable balance of accuracy, robustness, and generalizability. As such, in current practice, the analysis of cine images still involves significant manual work, including contour tracing, or initialization and correction to aid semi-automated computer methods.
Current development of deep Convolutional Neural Networks (CNN) has made revolutionary improvement on many medical image analysis problems, including automated cine MRI analysis [3, 4]. In most of the CNN-based framework for cine MRI, nevertheless, the segmentation problem is still formulated as learning a label image from a given cine image, i.e. each frame is individually processed and there is no guarantee of temporal coherence in the segmentation results.
1.2 Our Motivation and Contribution
This is in contrast to what we have observed in clinical practice, as clinical experts always view the cine MRI as a temporal sequence instead of individual frames, paying close attention to the temporally-resolving motion of the heart. Inspired by the expert way of view cine MRI, we aim to integrate the temporal information to guide and regulate LV segmentation, in an easily interpretable manner.
Between temporally neighboring frames, there are two types of useful information: (1) Difference: the relative movement of the object between neighboring frames, providing clues of object location and motion. (2) Similarity: sufficient coherence exists between temporally neighboring frames, with the temporal resolution of cine set to follow cardiac motion. In this work, we proposed to use optical flow to extract the object location and motion information, while aggregating such information over a moving time window to enforce temporal coherence. Both difference and similarity measures were formulated into one module, named “optical flow feature aggregation sub-network”, which is integrated into the U-net architecture. Compared to the prevailing recurrent neural network (RNN) applied to temporal sequences [4], our method eliminates the need of introducing massive learnable RNN parameters, while preserving the simplicity and elegancy of U-net. In relatively simple scenarios like cine MRI, our proposed method has high interpretability and low computation cost.
2 Method
2.1 Optical Flow in Cine MRI
Given two neighboring temporal frames in cine MRI, the optical flow field can be calculated to infer the horizontal and vertical motion of objects in image [4], by the following equation and constraint:
where \( V_{x} \), \( V_{y} \) are the velocity components of the pixel at location \( x \) and \( y \) in image \( I \). As the major moving object in the field of view, the optical flow provides essential information on the location of LV, as well as its mode of motion, as illustrated in Fig. 1, in which the background is clearly suppressed.
2.2 Optical Flow Feature Aggregation
We propose to integrate the optical flow into the feature maps, which are extracted by convolutional kernels:
where \( I\left( \cdot \right) \) is the bilinear interpolation function as is often used as a warp function in computer vision for motion compensation [5], \( m_{i} \) represents the feature maps of frame \( i \), \( O_{j \to i} \) is the optical flow field from frame \( j \) to frame \( i \), and \( m_{j \to i} \) represents the motion-compensated feature maps.
We further aggregated the optical flow information over a longer time span in the cardiac cycle. The aggregated feature map is defined as follows:
where \( k \) denotes the number of temporal frames before and after the target frame. Larger \( k \) indicates higher capability to follow temporal movement but heavier computation load. We used \( k\, = \,2 \) as an empirical choice to balance computation load and capture range. The weight map \( w_{j \to i} \) measures the cosine similarity between feature maps \( m_{j} \) and \( m_{i} \) at all \( x \) and y locations, defined as:
The feature map \( m_{i} \) and \( m_{j} \) contain all channels of features extracted by convolutional kernels (Fig. 2), which represent low-level information of the input image, such as location, intensity, and edge. Computed over all channels, \( w_{j \to i} \) describes local similarity between two temporally neighboring frames. By introducing the weighted feature map, we assign higher weights on locations with little temporal movement for coherent segmentation, while lower weights on locations with larger movement to allow changes.
2.3 Optical Flow Net (OF-net)
The proposed optical flow feature aggregation is integrated into the U-net architect, which we name as optical flow net (OF-net). The OF-net consists of the following new characteristics compared to the original U-net:
Optical Flow Feature Aggregation Sub-network:
The first part of the contracting path is made of a sub-network of optical flow feature aggregation described in Sects. 2.1 and 2.2. With this sub-network embedded, the segmentation of an individual frame takes into consideration information from neighboring frames, both before and after it, and the aggregation acts as a “memory” as well as a prediction. The aggregated feature maps are then fed into the subsequent path, as shown in Fig. 2.
Dilated Convolution:
The max-pooling operation reduces the image size to enlarge the receptive field, causing loss of resolution. Unlike in the classification problem, resolution can be important for segmentation performance. To improve the LV segmentation accuracy, we propose to use dilated convolution [6] to replace part of the max-pooling operation. As illustrated in Fig. 3, dilated convolution enlarges the receptive field by increasing the size of convolution kernels. We replaced max-pooling with dilated convolution in 8 deep layers as shown in Fig. 2.
Res-Block:
To mitigate the vanishing gradient problem in deep CNNs, all blocks in the U-net (i.e. a convolutional layer, a batch normalization layer, and a ReLU unit) were updated to res-block [7], as illustrated in Fig. 2.
The proposed OF-net preserves the U-shape architecture, and its training can be performed the same way as U-net without need of joint-training, as optical flow between MRI frames only need to be computed once. Simplified algorithm is summarized in Algorithm 1. \( N_{feature} \), \( N_{segment} \) are sub-networks of feature extractor and segmentation, respectively. \( P\left( \cdot \right) \) denotes computation of optical flow.
3 Experiments and Results
3.1 Data and Ground Truth
Experiments were performed on the short-axis steady-state free precession (SSFP) cine MR images of 100 patients with coronary artery disease and prior myocardial infarction from the Cardiac Atlas database [8]. A large variability exists in the dataset: the MRI scanner systems included GE Medical Systems (Signa 1.5T), Philips Medical Systems (Achieva 1.5T, 3.0T, and Intera 1.5T), and Siemens (Avanto 1.5T, Espree 1.5T and Symphony 1.5T); image size varied from 138 × 192 to 512 × 512 pixels; and the number of frames per cardiac cycle ranged from 19 to 30.
Ground truth annotations of the LV myocardium and blood pool in every image were a consensus result of various raters including two fully-automated raters and three semi-automated raters demanding initial manual input. We randomly selected 66 subjects out of 100 for training (12,720 images) and the rest for testing (6,646 images). All cine MR and label images were cropped at the center to a size of 128 × 128. To suppress the variability in intensity range, each cine scan was normalized to a uniform signal intensity range of [0, 255]. Data augmentation was performed by random rotation within [−30°, 30°], resulting in 50,880 training images.
3.2 Network Parameters and Performance Evaluation
We used stochastic gradient descent optimization with an exponentially-decaying learning rate of \( 10^{ - 4} \) and a mini-batch size of 10. The number of epochs was 30. Using the same training parameters, 3 CNNs were trained: (1) the original U-net, (2) the OF-net with max-pooling, (3) the OF-net with dilated convolution. The performance of LV segmentation was evaluated in terms of Dice overlap index and average perpendicular distance (APD) between the ground truth and CNN segmentation results. Since LV segmentation is known to have different degree of difficulty at apex, middle, and base, we evaluated the performance in the three segments separately.
3.3 Results
The Dice and APD of the three CNNs are reported in Table 1. It can be seen that the proposed OF-net outperformed the original U-net at all segments of LV (p < 0.001), and with the dilated convolution introduced, the performance is further enhanced (p < 0.001).
Some examples of the LV segmentation results at apex, middle, and base of LV are shown in Fig. 4. It can be observed from (a)–(c) that the proposed method is able to detect a very small myocardium ring at the apex which may be missed by the original U-net. From (g)–(i) it is seen that the OF-net eliminates localization failure at the base. In the middle slices (d)–(f), the OF-net also produced smoother outcome than the original U-net which processes each slice individually. The effect of integrating temporal information is better illustrated in Fig. 5, in which we plotted the myocardium (upper panel) and blood pool (lower panel) area, as determined by the resulting endocardial and epicardial contours, against frame index in a cardiac cycle. It can be observed that the results produced by OF-net is smoother and closer to the ground truth than those produced by U-net, showing improved temporal coherence of segmentation.
In Fig. 6, we illustrate the mechanism how aggregated feature map can help preserve the temporal coherence: the 14th channel in the sub-network is a localizer of LV. While localization of LV in one frame can be missed, the aggregated information from neighboring frames can correct for it and lead to coherent segmentation.
4 Conclusion
We have proposed an OF-net for fully automated segmentation of LV from cine MRI. The network integrates temporal information to imitate the expert way of viewing cine. Evaluated on the Cardiac Atlas database, the method outperformed the original U-net, producing more accurate and temporally-coherent LV segmentation.
References
de Roos, A., Higgins, C.B.: Cardiac radiology: centenary review. Radiology 273(2S), S142–S159 (2014)
Peng, P., Lekadir, K., Gooya, A., Shao, L., Petersen, S.E., Frangi, A.F.: A review of heart chamber segmentation for structural and functional analysis using cardiac magnetic resonance imaging. Magma 29, 155–195 (2016)
Litjens, G., et al.: A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017)
Xue, W., Lum, A., Mercado, A., Landis, M., Warrington, J., Li, S.: Full quantification of left ventricle via deep multitask learning network respecting intra- and inter-task relatedness. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) MICCAI 2017. LNCS, vol. 10435, pp. 276–284. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66179-7_32
Zhu, X., Wang, Y., Dai, J., Yuan, L., Wei, Y.: Flow-guided feature aggregation for video object detection. In: ICCV, pp. 408–417, October 2017
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR, May 2016
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778, June 2016
Fonseca, C.G., et al.: The cardiac atlas project–an imaging database for computational modeling and statistical atlases of the heart. Bioinformatics 27(16), 2288–2295 (2011)
Acknowledgements
This work was supported by National Key Research and Development Program of China (No. 2018YFC0116303).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Yan, W., Wang, Y., Li, Z., van der Geest, R.J., Tao, Q. (2018). Left Ventricle Segmentation via Optical-Flow-Net from Short-Axis Cine MRI: Preserving the Temporal Coherence of Cardiac Motion. In: Frangi, A., Schnabel, J., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds) Medical Image Computing and Computer Assisted Intervention – MICCAI 2018. MICCAI 2018. Lecture Notes in Computer Science(), vol 11073. Springer, Cham. https://doi.org/10.1007/978-3-030-00937-3_70
Download citation
DOI: https://doi.org/10.1007/978-3-030-00937-3_70
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00936-6
Online ISBN: 978-3-030-00937-3
eBook Packages: Computer ScienceComputer Science (R0)