Introduction

Cardiac ultrasound is among the most widely used imaging modalities for study of the heart. 2D Echocardiography (echo) is a basis for acquisition of a broad range of diagnostic measurements for the evaluation of cardiac structures and functions, such as cardiac output, left ventricular ejection fraction (LVEF), and diastolic function. Compared to magnetic resonance imaging (MRI) and computed tomography (CT), ultrasound imaging is less costly, non-ionizing, and available to a wider range of patients such as those with cardiac implant devices.

As technology advances, the traditional cart-based or laptop-size ultrasound machines are being replaced by portable point-of-care ultrasound (POCUS) devices, such as Philips Lumify, Clarius and Butterfly iQ. These are often packaged as an ultrasound probe, and a wireless or wired connection to a mobile device. This trend can be ascribed to the cost-effectiveness and ease of access of POCUS devices. Specifically, in an emergency or critical care scenario, a portable device enables clinicians to perform agile preliminary exams on patients and proceed with time-critical and potentially life-saving diagnostic decisions. Recent studies [9, 20, 21] also show that POCUS is beneficial to anesthesia practices.

Medical image analysis has enjoyed significant progress in recent years, specifically with the emergence of deep learning techniques [15, 32, 34]. A comprehensive survey of deep learning-based medical image analysis can be found in [17]. Particularly, we found a rich collection of literature in ultrasound image analysis given its large demand in clinical applications, such as using deep convolutional neural networks (CNN) and recurrent neural networks (RNN) to locate fetal standard imaging planes in cine clips [5, 6]; using CNN for echo image quality estimation [1]; and using neural networks to generate text description for valvular diseases from Doppler images[22]. Furthermore, on 3D ultrasound, Ghesu et al. [10] use shallow and deep sparse networks to detect and segment aortic valve. Finally, a CNN-based system for fully automated cardiac structure and function determination and disease detection can be found in [35, 36].

LVEF is a key metric to assess cardiac functionality and can be derived from apical two-chamber (AP2) and apical four-chamber (AP4) 2D echo exams. On a conventional ultrasound machine, in order to perform accurate LVEF calculation, the sonographer is required to capture a high-quality cine series that includes the end-diastolic (ED) and end-systolic (ES) frames of a cardiac cycle. With the assistance of the manufacturer’s built-in software, the sonographer manually provides the main axis of LV and traces its boundary in the ED and ES frames. Subsequently, the LV volume at ED and ES phases is calculated to compute the LVEF ratio.

An early attempt for solving LV segmentation in 2D echo was with the use of Deep Belief Networks (DBN) for improving the robustness of a model trained on a small dataset [4]. This method was extended in [3] by combining DBN and dynamic models for LV tracking. Nascimento et al. [23] proposed to combine manifold learning with DBN for multi-atlas LV segmentation problem. Chen et al. [7] proposed a multi-view regularized Fully Convolutional Network (FCN) [18] model for improving LV segmentation in echo images. In [26], anatomical shapes learned through a T-L network [11] were used to regularize the training of deep learning networks for LV segmentation in both 3D ultrasound and MR images. The U-net [29] architecture–a popular method for medical image segmentation is also implemented for LV segmentation in 2D echo [33]. There are a number of research groups aiming to address LV segmentation in cardiac MR and CT images, where the techniques are often adaptable to echo images. Avendi et al. [2] combined CNN, stacked auto-encoders (AE), and deformable models to handle automatic LV segmentation in short-axis cardiac MR slices, where the CNN was used to localize LV and the stacked AE was used to infer the LV shape. A two stage strategy was utilized by Zreik et al. [37] to segment LV in 3D cardiac CT volumes, where a 3D LV bounding box was determined as an aggregated predictions of three 2D CNNs, while a voxel classification CNN was used to segment LV within the bounding box. Ngo et al. [24] proposed to use one DBN for LV localization and another one to perform initial LV segmentation, where level set was introduced to refine the segmentation. Poudel et al. [27] introduced incorporation of a recurrent connection within the U-net architecture to segment LV from short-axis cardiac MR slices. Patch-based CNNs were integrated into an active contour framework for LV boundary extraction by Rupprecht et al. [30].

Fig. 1
figure 1

Block diagram of the proposed mobile system for real-time LV segmentation, landmark detection, and biplane ejection fraction estimation

In comparison with a conventional ultrasound machine, a mobile POCUS device has a limited processing power, memory, and storage space, to execute a live diagnostic software. In addition, while semi-automated LV segmentation systems can improve the accuracy of LV segmentation [24], it is practically hard to manually trace or correct LV borders on the screen of a hand-held device. To address the above issues, in this work, we aim to develop an integrated mobile application that provides a computationally efficient, automated, and accurate LVEF estimation without the need for user intervention.

The proposed method aims to calculate LVEF with the use of the biplane Simpson’s method [16, 31]–a standard method to calculate LV volume v at ED and ES phases:

$$\begin{aligned} v = \frac{\pi }{4} \sum _{i=1}^{n} a_i b_i\frac{L}{n}, \end{aligned}$$
(1)

where L represents the length of the ventricular cavity (i.e., the longest axis of LV, to be measured by the distance between LV apex to the middle of the mitral valve annular plane), and a and b are diameters of n equal height cylinders that are apportioned by dividing LV along L into n equal sections. Note that L, a, and b are measured from the LV segmentation map detected from the perpendicular AP2 and AP4 echo views [31]. In order to achieve an accurate LVEF estimation, the apex and middle mitral valve plane landmarks (denoted as LV landmarks) should also be accurately located in addition to an accurate LV segmentation. Therefore, we propose a novel LVEF estimation framework which involves the use of a multi-task deep learning network to simultaneously solve LV segmentation and LV landmarks detection problems.

For the purpose of fitting the application within the memory and computational constraints of a POCUS system, we implement the proposed multi-task approach as a lightweight model. In general, a lightweight network (i.e., a shallower and/or a slim network) does not perform as well as their deep and wider counterparts. To alleviate this issue, we adopt an adversarial training mechanism [12, 19, 28] to correct higher-order inconsistencies between the expert ground truth of LV segmentation maps and the prediction maps produced by the network. The adversarial training is able to regularize the network parameters in order to reduce over-fitting and hence, improve validation accuracy [19].

To summarize, our contributions in this work are:

  • the proposed framework is the first automated pipeline for LVEF estimation using POCUS mobile devices and biplane Simpson’s method;

  • the proposed segmentation network is implemented as a lightweight multi-task network with the performance enhanced by adversarial training.

The block diagram for the proposed system is shown in Fig. 1. A cardiac POCUS device captures echo frames live from the patient and transmits images to the mobile application. The system input could be provided by an ultra-portable hand-held ultrasound probe, or a conventional cardiac ultrasound machine through a frame grabber. The operator captures AP2 and AP4 echo views, which are the standard echo planes to study the LV. The mobile application tracks the LV region in captured frames to calculate the LVEF. A deep learning-based segmentation network acts as the core intelligence of the mobile application. The segmentation network is trained offline on archived echo data to simultaneously segment LV area and the two LV landmark points. The detected regions are used in the pipeline to estimate LVEF based on the aforementioned Simpson’s method [31].

In Sect. 2, the details of the system workflow and the implemented mobile application are explained. The proposed multi-task segmentation and landmark detection methodology is discussed in Sect. 3.

Mobile application

Software pipeline

Figure 1 shows the data flow pipeline of the software. The application can be set to accept three different sources of input: bitmaps saved on the Android device, live frames from the Clarius probe streamed over a wireless connection, or live frames from a cart-based ultrasound machine streamed through a frame grabber.

Fig. 2
figure 2

Sample detected LV segmentation and the LV landmark points from the AP4 view. Left is the heart at the end-diastolic (ED) phase and right is the end-systolic (ES) frame. The volume ratio between ED and ES frames calculated using the segmentation area and the landmark points is used to calculate LVEF

The simplest way to receive input data is by storing datasets directly on the device’s internal memory and then loading them frame by frame at a pre-specified frame rate. Alternatively, the wireless Clarius ultrasound probe can be used to stream live data over a wireless network. Finally, the device can also accept serial input through its USB-C port: we use an7 Epiphan AV.IO frame grabber to capture and convert the output from the DVI port of any cart-based ultrasound machine, and pipe it directly into the Android device using a standard USB-C connection. When using this modality, we crop the raw frame-grabbed data so as to only include the ultrasound beam, the boundaries of which are set by the user once for each cart-based system.

Once properly connected, the application converts each full resolution ultrasound frame to a bitmap and displays it to the user in the application Graphical User Interface (GUI). After the user initiates the segmentation option, the application down-samples the raw frame data to the input dimensions of the neural network (128\(\times \)128 pixels in our implementation). The resized frames are then sent to one of four concurrently running instances of TensorFlow Mobile Java inference engine. Each of these instances loads and runs the resized frames through the segmentation network, the design and training of which is described in Sect. 3. There are two outputs from the segmentation networks: the segmentation and the landmarks, shown in Fig. 2. The segmentation output (green) is a 128\(\times \)128 binary mask. The landmarks output (orange) is also a binary mask this time containing two blobs, one representing the most likely location of the LV’s apex, and the other the mitral valve. The network outputs are then resized back up to the original frame dimensions, overlaid onto the original bitmap, and displayed in the application GUI. The outputs are also used to calculate LVEF, as described in Sect. 2.2.

Since the system runs in a resource limited environment (i.e., on the a mobile phone), a concerted effort had to be made to achieve the desired frame rate of 30 Hz with minimal latency and no frame drop. The largest bottleneck along the data pipeline is the time it takes to run the segmentation network. We tested several networks with increasing model size in order to determine their suitability for mobile deployment. The run time statistics are shown in Table 1. The number of base filters refers to the number of the filters in the first layer of the U-net doubled after each down-sampling step.

Regardless of the network used, we need to multi-thread multiple segmentation network runners (SEGs) concurrently in order to achieve a per-frame processing time of \(1/30\;\hbox {Hz} = 33.3\) ms. In order to prevent the application from lagging, the SEGs must finish their execution before they are fed with their next frame, i.e., all the per-frame processing must be completed within \(T_{\mathrm {max}}\), calculated as follows:

$$\begin{aligned} T_{\mathrm {max}} = \frac{\mathrm {\#~of~SEGs}}{\mathrm {FPS}} > \mu _{\mathrm {bs}} + 2 \sigma _{\mathrm {bs}} . \end{aligned}$$
(2)
Table 1 Mean and standard deviation for the per-frame run times of the segmentation networks with different sizes

In practice, we found that requiring the mean run time be two standard deviations less than \(T_{\mathrm {max}}\) is sufficient to prevent any noticeable lag during the program’s execution. Using the data from Table 1 in Eq. (2), we determine that the minimum required number of concurrently running SEGs is four for a base filter of four, eight for a base filter of eight, and 27 for a base filter of 16. Each SEG instance requires roughly 15 MB of RAM and roughly 10% CPU usage. Additionally, the system’s latency is hard capped by the run time of the network. For these reasons, we chose to use the smallest of the tested networks, using a base filter of four.

Ejection fraction calculation

After each run, the SEG threads send their outputs to the EF Calculator class through asynchronous callbacks, as shown in Fig. 1. The segmentation maps and landmarks are then buffered until there is enough data to find the ED and ES frames of the cardiac cycle. This is done by simply finding the maximum and minimum areas of the buffered LV segmentations, corresponding to the ED and ES frames, respectively. A 60-frame buffer is used to capture the entirety of any heart cycle above 30 bpm. The landmarks of these two frames are then used to calculate \(L_{\mathrm {ED}}\) and \(L_{\mathrm {ES}}\), i.e., the respective longitude of the LV measured from apex to middle of the mitral valve. This is done by finding the largest two connected components presented in the landmark output prediction map, finding the coordinates of their centers of mass (CoMs), and calculating the Euclidean distance between them, in pixels. This unit of L measurement can be converted to centimeter by dividing it by the pixel resolution, while the unit of segmentation area A can be converted to squared centimeters by dividing it with the pixel density. Note that knowledge of the ultrasound imaging depth and pixel spacing are required to make these conversions. Single-plane LV volume can then be estimated as:

$$\begin{aligned} v_{s} = 0.85 \frac{A^2}{L}. \end{aligned}$$
(3)

Using Eq. 3 for estimating LV volumes in both ED and ES frames, we can calculate LVEF as:

$$\begin{aligned} e = \frac{V_{S}^{\mathrm {ED}} - V_{S}^{\mathrm {ES}}}{V_{S}^{\mathrm {ED}}}. \end{aligned}$$
(4)

The single-plane volume calculation shown in Eq. 3 can be performed using data from either the AP4 or AP2 view; however, we can produce a more accurate 3D volume estimation by considering both cross-sections simultaneously. Once we have captured and buffered frames from both views, we can calculate the biplane volume for both ED and ES frames using an adaptation of the Simpson’s disk counting method [31]. First, we rotate the ED and ES frames from both AP4 and AP2 views, such that their L’s are vertically aligned. We then scale the frames such that \(L_{\mathrm {AP4}}^{\mathrm {ED}} = L_{\mathrm {AP2}}^{\mathrm {ED}}\) and \(L_{\mathrm {AP4}}^{\mathrm {ES}} = L_{\mathrm {AP2}}^{\mathrm {ES}}\), since although the AP4 and AP2 images may appear different in scale, we know the underlying anatomy to be constant. Once properly rotated and scaled, we can apply a variant of Eq. 1, summing over the pixel length of L:

$$\begin{aligned} v_{b}= & {} \frac{\pi }{4} \sum _{i=1}^{L_{\mathrm {px}}} a_{(i,\mathrm {cm})}~b_{(i,\mathrm {cm})} \frac{L_{\mathrm {cm}}}{L_{\mathrm {px}}} = \frac{\pi }{4} \sum _{i=1}^{L_{\mathrm {px}}} \frac{a_{(i,\mathrm {px})}}{r} \frac{b_{(i,\mathrm {px})}}{r} \frac{1}{r} \nonumber \\= & {} \frac{\pi }{4} \sum _{i=1}^{L_{\mathrm {px}}}\frac{a_{(i,\mathrm {px})}~b_{(i,\mathrm {px})}}{r^3}, \end{aligned}$$
(5)

where \(a_{(i,\mathrm {px})}\) equals to the pixel width of each horizontal pixel line in the AP4 image, \(b_{(i,\mathrm {px})}\) equals to the width of the AP2 lines, and r is the pixel resolution of the image. By running this calculation for both pairs of ED and ES frames, we can refine our EF estimate from Eq. 4 by using the more accurate biplane LV volume estimation.

The geometric approximation assumptions of single-plane (monoplane) and biplane area-length techniques are fairly similar. Monoplane EF estimation can be used in cases where only one of the AP2 or AP4 views is available. Grossgasteiger et al. [13] compared the accuracy and feasibility of six commonly used 2D methods to assess LV function. Biplane Simpson method has the strongest correlation with 3D echo in LVEF, followed by the AP4 and AP2 Simpsons monoplane methods, respectively.

Left ventricle segmentation and landmark detection

In this section, we discuss the details of the core intelligence of the mobile application, i.e., the LV segmentation and landmark detection method. We propose a multi-task deep learning approach to simultaneously segment LV and detect the two LV landmarks. This method consists of a segmentation network (S) and a critic network (C), which are shown in Fig. 3. The segmentation model estimates the LV region and the two landmarks (LMs). The critic network is used in the training as an adversarial framework to improve the segmentation output.

Fig. 3
figure 3

Architecture of the adapted deep fully convolutional network for simultaneous LV segmentation and LV landmark detection. Critic model (C) is only used in the training phase. The trained network parameters of the multi-task segmentation network (S) are frozen and deployed on the mobile application

Segmentation model

We implemented a network based on the U-net [29] architecture as our segmentation network. The U-net is a fully convolutional segmentation model including a down-sampling feature extraction, an up-sampling reconstruction path, and skip connections between the down-sampling and up-sampling blocks that share the same output feature size. Our U-net implementation is modified by adding two branches to its last up-sampling layer. One branch of the multi-task segmentation network predicts LV segmentation, and the other branch is used to detect the location of two LV landmarks, which are the LV apex point and middle of the mitral valve point, in both AP2 and AP4 views. We denote \(S_{\mathrm {LV}}(f; \theta _s)\) and \(S_{\mathrm {LM}}(f; \theta _s)\) as the functions to estimate the LV region and two LMs from the input frame f. The LV region and the location of the two landmark points are used in Eq. 1 for biplane EF estimation.

To train the segmentation network, we use Dice loss \(\mathcal {L}_{\mathrm {LV}}\) to compare the LV prediction of the network with the ground truth p. A weighted binary cross-entropy \(\mathcal {L}_{\mathrm {LM}}\) is used as the loss function for the network’s landmarks detection. Detection of the centroid of the landmarks is formulated as a segmentation problem. This results in a highly unbalanced dataset, i.e., there are only two points in the landmark class, compared to all other pixels of the image which belong to the background group. To rectify this unbalance distribution of classes, two solutions are applied. First, a circle with radius R is defined around each landmark point in training samples. In the test set, the centers of mass of the predicted connected components are used as the location of landmark points. Next, a class weighting approach is applied to the cross-entropy loss according to the number of samples in landmark and background classes, in order to balance against their population during the training, i.e., , a higher weight is given to the under-represented landmark class. In our method, a weight of \(W_c = \frac{T}{2 T_c}\) is given to the class c, where \(c \in \{landmark, background\}\), T is the total number of pixels in a training sample, and \(T_c\) denotes the number of pixels in the class c.

Critic model

The outputs of the multi-task segmentation network S then are fed to a critic network C. The predicted LV region and the landmark locations are element-wisely summed and re-normalized between 0 and 1, and are then passed to the critic network.

The critic network is a CNN that tries to discriminate if an annotation is done by the cardiologist (called True) or by the segmentation network (called Fake), i.e., \(y = C(m; \theta _c)\), where \(y \in \{True, Fake\}\) and m represents an annotation map, showing LV region and landmarks. Trained by a binary cross-entropy loss (\(\mathcal {L}_C\)), the critic network learns to discriminate the distributions of the ground truth annotations versus the outputs of the segmentation model. The critic encourages the prediction of segmentation network toward converging to the distribution of True masks, i.e., the segmentation network produces results that are not distinguishable from the annotations done by the cardiologist. This way, a higher-order shape-wise constrain is implied on the segmentation network’s output, which can be difficult to express in a standard per-pixel loss function [19]. The critic model can verify the shape integrity of the predicted LV masks and the localization accuracy of the LV landmarks.

Adversarial training

Given the set of predictions \(\{S_{\mathrm {LV}}(f;\theta _s),S_{\mathrm {LM}}(f;\theta _s)\}\) and \(C(m; \theta _c)\), the segmentation model is trained to minimize:

$$\begin{aligned} \mathcal {L}(\theta _s)= & {} \lambda _1 \mathcal {L}_{\mathrm {LV}}(S_{\mathrm {LV}}(f;\theta _s),p) + \lambda _2 \mathcal {L}_{\mathrm {LM}}(S_{\mathrm {LM}}(f;\theta _s),q) \nonumber \\&+\, \lambda _3 \mathcal {L}_C\big ( C(m;\theta _c),True\big ), \end{aligned}$$
(6)

where p and q are the respective ground truth for LV segmentation and LV landmark locations; \(\lambda _1\), \(\lambda _2\), and \(\lambda _3\) are weighting parameters of respective loss terms; \(m = \mathrm {Merge}\big (S_{\mathrm {LV}}(f;\theta _s\big ), S_{\mathrm {LM}}(f;\theta _s))\) sums and re-normalizes \(S_{\mathrm {LV}}(f;\theta _s)\) and \(S_{\mathrm {LM}}(f;\theta _s)\); and \(\mathcal {L}_C\) encourages S to produce segmentation maps that could fool the discriminator C to recognize the maps as True. Throughout the learning phase, the segmentation network and the critic network are alternatively trained together in an adversarial framework. In each learning iteration, the segmentation network is trained to minimize Eq. 6, and the model parameters of the critic network, \(\theta _c\), are kept unchanged during minimization of the loss in Eq. 6. The critic is also kept trained with \(\mathcal {L}(\theta _c) = \mathcal {L}_C\big ( C(\mathrm {Merge}(p, q);\theta _c),True\big ) + \mathcal {L}_C\big ( C(m;\theta _c),False\big )\) to classify between the distribution of ground truth annotations and the distribution of the predicted masks made by S. This in turn pushes the segmentation model S toward generating masks that are similar to the cardiologist’s marks and hence, an implicit shape prior is enforced on the joint space of the predicted LV segmentation and landmark locations.

Network’s architecture

The multi-task segmentation model S is based on the U-net model. S has four down-sampling and four inverse up-sampling steps with concatenating skip connections. All max-pooling layers have a size of \(2\times 2\) with a stride of one. All convolutional layers have kernel size of \(3\times 3\) with the stride of one, followed by a batch normalization layer and Relu activation function. The activation function in the last layer is selected to be sigmoid. The base number of filters is set to four, which is doubled after each down-sampling step, resulting in a small lightweight network with about 123k trainable parameters suitable to run smoothly on a mobile device.

The critic network C is a CNN with three convolutional layers followed by two layers of fully connected neurons. The first two convolutional layers are down-sampled using average pooling. Convolutional kernels in C have the size of \(3\times 3\), pooling layers have the size of \(2\times 2\), all with the stride of one. The number of filters in the first convolutional layer is set to 16, and doubled after each down-sampling. The network is terminated with a two-layer fully connected network with 64 and one neurons, respectively, the latter of which outputs True or Fake classifications. All intermediate layers in C are followed by batch normalization, \(Leaky \ Relu\), and dropout with the ratio of 0.25. The activation function in the last layer is sigmoid.

Experiments

Dataset and implementation details

The proposed application is evaluated over a dataset of 854 echo studies, collected from the Picture Archiving and Communication System at Vancouver General Hospital, with ethics approval of the Clinical Medical Research Ethics Board, in consultation with the Information Privacy Office. The data includes pairs of AP2 and AP4 echo views from 427 patients. For all echo studies, the segmentation and location of landmarks of the LV are annotated by an expert cardiologist at the ED, ES, and a random middle frame between ED and ES phases. The cardiologist’s annotations are regarded as the ground truth. For each patient, the ground truth for LVEF is provided using cardiologist’s annotations.

Echo cines are loaded onto the mobile application to obtain the AP2 and AP4 LV segmentations, landmarks, and the biplane ejection fraction. The dataset is randomly split into five non-overlapping groups based on the patients. To obtain results on the entire dataset, the experiment is done five times, where in each run, one group is set aside unseen as the test and the training is done with the other four groups. Therefore, the training to test ratio is 80% to 20%, respectively. Also in each run, 10% of the data in the training is used as validation to search for the optimal hyper-parameters.

The network is implemented in Keras with Tensorflow trained on a PC system. The weights of the network are then frozen and transferred to the mobile application. The mobile device used is a Samsung S8+, with 6 GB of RAM, running a Snapdragon Octa-core processor (\(4\times 2.45\) GHz and \(4\times 1.9\) GHz CPUs). Adam optimizer is used to train the network. \(\lambda _1\) to \(\lambda _3\) in Eq. 6 are set to 1, 1 and 0.1, respectively. The circles around landmark points in training have the radius of \(R=7\) pixels in echo images of size \(128\times 128\). Two separate networks with similar architecture are trained for AP2 and AP4 views. Lambdas and R are the hyper-parameters of the model optimized using the validation set. The network’s training is done on ED, ES, and the random middle frame (RF) of echo cines, where a ground truth by the cardiologist is available.

Table 2 Evaluation of LV segmentation performance in AP2 and AP4 echo views
Table 3 Evaluation of LV landmark detection in AP2 and AP4 views
Table 4 Evaluation of LVEF estimation in AP2 single view, AP4 single view, and AP2–AP4 biplane views

Quantitative evaluation

Here, we evaluate the results in each of the steps of the proposed pipeline. The steps include AP2 LV segmentation, AP2 LV landmark detection, AP4 LV segmentation, AP4 LV Landmark detection, and finally using the segmentation masks to obtain a biplane LVEF estimation.

U-net is considered as a standard state-of-the-art model for medical image segmentation tasks. Works of [35] and [33] propose variations of U-net for echocardiogram segmentation. We compare the performance of the U-net with and without using the proposed training method. The applied U-net has four base filters with about 122 k training parameters. The per-frame run time of the mobile framework with respect to the size of the network and the justification of the choice of four base filters are discussed in Sect. 2.1. In Table 2, we also compare our method to another widely used segmentation model, namely the DeconvNet  [25], with four base filters and 149 k training parameters. The comparison is done on ED, ES, and a random frame from systole or diastole phases of the heart. The random middle frame gives an estimation of the segmentation performance over the whole cine clip. The cardiologist’s segmentation masks are referenced as ground truth. Results of LV segmentation in AP2 and AP4 echo views are presented in Table 2. Our proposed multi-task network also automatically detects the location of the landmarks of the LV. The Euclidean distance between detected landmark points by our method compared to the cardiologist annotations is shown in Table 3. The distance is presented in pixel (px) space for echo images of size \(128\times 128\). The LV segmentation and landmark points are used in the pipeline to automatically calculate biplane LV ejection fraction. LVEF estimation errors are presented in Table 4. Evaluated over 427 patients, automatically estimated biplane LVEF percentage by our method has a median absolute error of 6.2%, and a mean absolute error of 7.8%, compared to the cardiologist’s opinion. The importance of the results could be further remarked noticing the reported [8] inter-observer variability of 17.8%, and intra-observer variability of 13.4%, existing for echo biplane LVEF estimation in clinician’s examinations.

Conclusion and discussion

In this paper, we presented a pipeline using mobile POCUS for biplane LVEF estimation. We proposed a lightweight multi-task segmentation framework, based on fully convolutional networks and adversarial training, for simultaneous LV segmentation and LV landmark detection. The software evaluated on pairs of AP2 and AP4 echocardiograms from 427 patients could reach a high correlation compared to the cardiologist’s assessments. Experiments presented show a mean dice score of 92% for LV segmentation, superior to existing comparable methods, and a mean Euclidean distance of 2.85 pixels for LV landmark detection. The predicted annotation set is used in the proposed pipeline to calculate biplane LVEF. The automatically estimated biplane LVEF by the proposed method has a mean absolute error of less than 8% compared to the cardiologist’s estimations.

Prognosis and therapeutic cardiac decisions are often based on LVEF measurement. LVEF estimation is one of the key cardiac measurements derived from echo studies. Manual quantification of the LVEF needs cardiologist or sonographer tracing of LV is time-consuming and labor intensive with relatively high inter-observer and intra-observer variability. In recent years, the ultrasound imaging has become widely accessible due to advances in development of cheap portable POCUS devices. This paper is an step toward automatic LVEF estimation on readily available android mobile devices, compatible with POCUS. The mobile cardiac POCUS has advantages of portability, low cost, accessibility, and immediacy of results, vital in applications such as emergency scenarios and anesthesia management [14].

Fig. 4
figure 4

Sample LV segmentation by our method compared to manual annotation by the cardiologist

Fig. 5
figure 5

Sample case of segmentation failure

Sample visual results of the LV segmentation by the proposed method compared to the manual annotation by the cardiologist are presented in Fig. 4, where (a) is a AP2 and (b) and (c) are AP4 views. Also, Fig. 5 presents a sample failed case. Low quality of the captured echo, foreshortening, and fuzzy borders could be mentioned as the reasons for the method failure. The LVEF estimation is directly dependent on the accuracy of the LV segmentation. The ratio of maximum to minimum LV volume in a heart cycle is used to estimate LVEF. The LV segmentation error might result in missing a part of LV, or on the contrary, labeling the surrounding muscle tissue area of the ventricle as a part of the LV. In each case, the error might directly affect the minimum or maximum measured LV volume, and in turn causes the LVEF estimation error. This source of error leads to an observed bias toward overestimation of the LVEF in our current result set. Investigation of machine learning solutions to guide the operator through acquisition of accurate high-quality echo views could be an improvement which we consider as a future work. Future work also includes the extension of the proposed multi-task segmentation model to other echo views and multiple heart chambers. The model could be extended to derive various clinically demanded cardiac metrics, such as LV wall motion abnormalities.