Introduction

Ultrasound (US) imaging is a low-cost, real-time, and non-ionizing method, which makes it an appealing choice for image-guided computer-assisted interventions in radiation therapy. Treatments of liver tumors using high-intensity focused ultrasound, intensity-modulated radiation therapy, or proton therapy enable precise dose delivery to the desired location. However, the target region during the treatment is affected by internal body motion, such as breathing, which is a major drawback in effectiveness of these treatments. Not taking the respiratory motion into account would cause deviations of the delivered dose distribution from the intended one and increase radiation exposure of healthy tissue while lowering dose to the target volume, which would reduce efficiency and aggregate complications [1].

One of the strategies to reduce breathing-induced organ motion during radiation treatment is deep inspiration breath hold method [2], where a patient performs a supervised breath hold during therapy, which requires active support and the ability of the patient to maintain such a breath hold. Another possible approach to compensate for breathing motion that does not require patient compliance is to track the position of the target region during therapy and dynamically adjust the radiation accordingly.

To use motion tracking algorithms for radiation therapy interventions, real time, accurate, and robust localization of the target region for the entire procedure is required. US imaging being non-ionizing and real time makes it an ideal choice for this aim [3]. There are numerous studies focusing on tracking of liver motion in US image sequences using different approaches, such as image registration [4], block matching [5], and optic flow [6]. However, these methods are generally affected by limitations of US imaging such as low signal-to-noise ratio (SNR) and large appearance changes of the tracked landmarks caused by, e.g., acoustic shadowing due to poor transducer–skin contact or highly reflecting anatomical structures like the ribs.

In this work, we propose to use supporters, a computer vision technique [7], to improve optic flow-based tracking. This relies on tracking additional image features, potentially beneficial for predicting the target position. To that end, a supporter model is built based on motion coupling observed on some frames between these tracked features (supporters) and the target. Using this model, the tracking can then be made robust to changes in target appearance, where a consensus voting of several supporter estimations can be used to infer target location.

Considering motion tracking in medical images, supporters were used earlier for determining two orthogonal MR acquisition planes through the heart valve [7]. Instead of the valve itself, which may leave the image, four annotated points (supporters) on a plane perpendicular to the valve were tracked to define the acquisition planes. A supporter model based on squared Euclidean distances was used to downgrade distant supporters. In [8], supporters were used for tracking abnormalities in video capsule endoscopy. First, the supporters were matched between successive frames by considering a triangular constraint, where the triangle shape is maintained while allowing weak deformations. Then, affine transformations calculated from the supporter triplet help determine abnormal positions, where the precise position is estimated from the features of the target itself. In [9], cells were tracked in spatiotemporal optical images from densely packed multilayer tissues. The tight spatial topology of neighboring cells was exploited as contextual information by applying spatiotemporal graph labeling. In [10], 600 supporters were detected in fluoroscopy images by using Kanade–Lucas–Tomasi feature tracker for automatic motion compensation. An autoregression model and motion clustering was employed for learning the relationship between supporter and target motion. Supporters were also used in many other typical computer vision applications, e.g., in [11,12,13,14,15]. Supporters have not been studied for motion tracking in US images. We hereby show that this method is particularly beneficial in cases where the target cannot be observed directly, such as due to occlusions from shadowing artifacts.

Note that particular challenges of US tracking are poor image quality and the relatively small number of landmarks suitable for tracking. Nevertheless, relative locations of liver landmarks stay stable during radiation therapy of liver tumors, which motivates the use of supporters in this work for 2D US tracking of the liver. We hereby devise an approach for effective supporter model creation from few supporters and evaluate this on a standard public dataset.

Methods

Motion tracking is the process of estimating the trajectory of an object over time by predicting its position in every frame of an image sequence. For image-guided computer-assisted applications, targets in moving organs such as the liver, prostate, and the heart are commonly tracked. Tracking an object position can be challenging, e.g., due to the appearance change over time, low SNR, or occlusions. In US images, tracked target can temporarily disappear by going out of the field of view or by being covered by a shadow due to poor transducer–skin contact or highly reflecting anatomical structures such as the ribs. To improve robustness of a conventional tracking algorithm for such cases, we propose combining it with a supporter model, which takes advantage of correlated surrounding motion.

Tracking with a supporter model

Grabner et al. [7] proposed a method for tracking the invisible using a set of local image features, called supporters, by exploiting the visual context and relative spatial relations to improve target tracking. Good supporters were defined as the image features whose motion is correlated with that of the target and, thus, might be useful for predicting the position of the target. For example, a wristwatch on a hand holding a target object is a good supporter for the position of that target (even when the target is not directly visible or trackable), since their motions are strongly correlated. Below we first summarize the supporter model [7] for the sake of completeness and then describe our methods for its adaption in this work.

Overview of supporter modeling Tracking with supporters has two main modes: learning the model and applying the model. The model captures the statistical relationship between the target and supporter positions and therefore provides a measure of how strongly the motion between each supporter and the target is coupled. This measure can then be used for adjusting the contribution of each supporter in the overall supporter prediction.

The overall goal is to learn and apply a probability density function (pdf) model, \(P({\mathbf {x}}|{\mathbf {I}})\), for predicting the position of target object, \({\mathbf {x}}=(x,y)\), in image \({\mathbf {I}}\) via the help of S tracked supporter positions \(\{{\mathbf {x}}_{s}|s=1,2, \ldots ,S\}\). For this aim, the relationship between supporter positions \(\{{\mathbf {x}}_{s}\}\) and the target position \({\mathbf {x}}\) is learned, providing conditional pdf \(P({\mathbf {x}}|{\mathbf {x}}_{s})\) for supporter s. Each supporter s then votes for potential target positions \({\mathbf {x}}\) via pdf \(P({\mathbf {x}}|{\mathbf {x}}_{s})\). These votes are combined by accounting for the reliability of the supporter position estimates \({\mathbf {x}}_{s}\) from \({\mathbf {I}}\) with probability \(P({\mathbf {x}}_{s}|{\mathbf {I}})\), resulting in pdf using law of total probability

$$\begin{aligned} P({\mathbf {x}}|{\mathbf {I}}) \propto \sum _{s=1}^{S} P({\mathbf {x}}|{\mathbf {x}}_{s})P({\mathbf {x}}_{s}|{\mathbf {I}}). \end{aligned}$$
(1)

The final target position is then determined by finding the position that has the highest likelihood in the voting space.

Learning a supporter model Let \({\mathbf {I}}^0, {\mathbf {I}}^1, \ldots ,{\mathbf {I}}^{F-1}\) be an image sequence consisting of F image frames, {\({\mathbf {x}}_{s}^{0}|s=1,2,...,S\)} be the set of S supporter positions of the first frame \({\mathbf {I}}^0\), and \({\mathbf {x}}^{0}\) be the target position of \({\mathbf {I}}^0\). The goal of the model is to estimate for frame \(I^f\) the most likely target position \({\mathbf {x}}^f\) from the observed supporter positions \(\{{\mathbf {x}}_{s}^f\}\). Assuming a translational relationship, this is based on learning per supporter s the conditional pdf of the relative target position \(\mathbf {u}_{s}={\mathbf {x}}-{\mathbf {x}}_{s}\) for a given \({\mathbf {x}}_{s}\). For on-line learning during tracking, the exponential forgetting principle between the so far learned pdf model \(P^{f-1}(\cdot )\) and the current pdf \(p(\cdot )\) is used:

$$\begin{aligned}&P^{f}(\mathbf {u}_{s}|{\mathbf {x}}_{s}) = \alpha \ P^{f-1}(\mathbf {u}_{s}|{\mathbf {x}}_{s}) + (1 - \alpha ) \ p(\mathbf {u}^{f}_{s}|{\mathbf {x}}_{s}^f), \end{aligned}$$
(2)
$$\begin{aligned}&P^{f}({\mathbf {x}}_{s}|{\mathbf {I}}) = \alpha \ P^{f-1}({\mathbf {x}}_{s}|{\mathbf {I}}) + (1 - \alpha ) \ p({\mathbf {x}}_{s}^f|{\mathbf {I}}^{f}) , \end{aligned}$$
(3)

where forgetting factor \(\alpha \in [0,1]\) weights the contribution of past and current pdfs. \(P^{f}(\mathbf {u}_{s}|{\mathbf {x}}_{s})\) is the model learned from frames 1 to f and provides the pdf of supporter position \({\mathbf {x}}_{s}\) voting for relative target position \(\mathbf {u}_{s}\). \(p(\mathbf {u}^{f}_{s}|{\mathbf {x}}_{s}^f)\) is the corresponding pdf derived only from the tracked positions in the current frame f. \(P^{f}({\mathbf {x}}_{s}|{\mathbf {I}})\) is the reliability model of the supporter position estimation learned from frames 1 to f. \(p({\mathbf {x}}_{s}^f|{\mathbf {I}}^{f})\) defines the reliability of supporter position \({\mathbf {x}}_{s}^f\). We will explain how \(P^{f}(\cdot )\) and \(p(\cdot )\) are defined in practice in “Robust motion tracking by estimating the target position using supporters” section.

Applying the supporter model Given image \({\mathbf {I}}^f\) and tracked supporter positions \(\{{\mathbf {x}}_{s}^f\}\), the learned supporter models \(P^{f}(\mathbf {u}_{s}|{\mathbf {x}}_s)\) and \(P^{f}({\mathbf {x}}_{s}|{\mathbf {I}})\) are evaluated for \({\mathbf {x}}_{s}\) = \({\mathbf {x}}_{s}^{f}\) and \({\mathbf {I}}={\mathbf {I}}^f\). From this the target position, \({\mathbf {x}}^f\) is estimated by using Eqs. (2) and (3) in Eq. (1), where the pdfs for the relative target positions are brought into the target space via \(P^{f}({\mathbf {x}}= \mathbf {u}_{s}+{\mathbf {x}}_{s}^f|{\mathbf {x}}_{s}^f) = P^{f}(\mathbf {u}_{s}|{\mathbf {x}}_{s}^f)\), i.e.,

$$\begin{aligned}&{\mathbf {x}}^f=\arg \max _{{\mathbf {x}}} P({\mathbf {x}}|{\mathbf {I}}^f) \quad \text{ with } \quad \nonumber \\&\quad P({\mathbf {x}}|{\mathbf {I}}^f) = \sum _{s=1}^S P^{f}({\mathbf {x}}|{\mathbf {x}}_{s}^f)P^{f}({\mathbf {x}}_{s}^f|{\mathbf {I}}^{f}). \end{aligned}$$
(4)

Robust motion tracking by estimating the target position using supporters

Tracking with supporters requires another tracking method to compute supporter locations and their reliability. Supporters can then assist and correct such a baseline method to achieve improved tracking results. We first summarize our method for a generic object tracker (see also Algorithm 1) and then instantiate it with a particular tracking method later below.

figure a

Input data Our method uses a given initial target position \({\mathbf {x}}^{0}\), a fixed set of initial supporter positions {\({\mathbf {x}}^{0}_{s}\)}, and reference patches around the target, \(\mathbf {B}^{0}\), and each supporter, \(\{\mathbf {B}^{0}_{s}\}\), where positions and reference patches are manually annotated in the first image frame \({\mathbf {I}}^0\). Note that the reference patches are manually chosen to contain distinct image appearance compared to their surrounding. For the current frame \(f>0\), we obtain target and supporter position estimations from the conventional object tracker, which are denoted as \({\mathbf {x}}^{f}_{t}\) and \(\{{\mathbf {x}}^{f}_{s}\}\), respectively.

Tracking reliability Assuming that the feature appearance changes only linearly during tracking, we use the correlation coefficient measure between image patches for estimating the tracking reliability. For this, we extract patches \(\mathbf {B}^{f}\) and \(\mathbf {B}^{f}_{s}\), of the same size as \(\mathbf {B}^{0}\) and \(\mathbf {B}^{0}_{s}\), centered around the tracked positions \({\mathbf {x}}^{f}_{t}\) and \({\mathbf {x}}^{f}_{s}\), respectively. Then, we calculate the correlation coefficient between the corresponding patches, i.e., \(\rho ^f = CC(\mathbf {B}^{0},\mathbf {B}^{f}\)) and \(\rho ^f_s = CC(\mathbf {B}^{0}_{s},\mathbf {B}^{f}_{s}\)). We employ reliability measure \(\rho ^f\) to decide whether to rely on the current target position for tracking and updating the model. Specifically, if \(\rho ^{f} \ge \theta _{CC}\), which is a learned threshold, we assume to have reliable object tracking and use this position, i.e., \({\mathbf {x}}^{f} = {\mathbf {x}}^{f}_{t}\). Furthermore, for another threshold \( \theta _{\textit{update}} > \theta _{CC}\), if \(\rho ^{f} \ge \theta _{\textit{update}}\), then the supporter model is updated as described next.

Fig. 1
figure 1

a Illustration of a supporter voting for a target position (arrow) with a probability distribution (image intensities) defined by mean \(\varvec{\mu }\) and covariance \(\mathbf {C}_{s}\). b Illustration of a 1D Gaussian mixture model (red) from two individual distributions (green and blue), with mean values indicated by vertical lines

Fig. 2
figure 2

Example of tracker and supporter predictions. Target position from main object tracker \({\mathbf {x}}^{f}_{t}\) (green), individual supporter predictions \(\varvec{\mu }^{f}_{s}+{\mathbf {x}}^{f}_{s}\) (blue), and weighted mean using Gaussian mixture model \({\mathbf {x}}^{f}_{p}\) (red) overlaid on a US image and b log-transformed probability density

Supporter model learning The supporter model \(P^f(\mathbf {u}_{s}|{\mathbf {x}}_{s}^f)\) from Eq. (2) is approximated with a 2D Gaussian distribution by

$$\begin{aligned} P^f(\mathbf {u}_{s}|{\mathbf {x}}_{s}^f) \propto \frac{1}{2 \pi \sqrt{ |\mathbf {C}_{s}^f|}} \exp \left( -\frac{1}{2} (\mathbf {u}_{s} {-} \varvec{\mu }_{s}^f)(\mathbf {C}_{s}^f)^{-1}(\mathbf {u}_{s} {-} \varvec{\mu }_{s}^f)^\intercal \right) , \end{aligned}$$
(5)

where \(\varvec{\mu }_{s}^f\) and \(\mathbf {C}_{s}^f\) denote the on-line learned mean and covariance matrix, respectively, of the relative target positions \(\mathbf {u}^{f}_{s}\) across frames, i.e.,

$$\begin{aligned} \varvec{\mu }^{f}_{s}&= \alpha \ \varvec{\mu }^{f-1}_{s} + (1-\alpha ) \ \mathbf {u}^{f}_{s}, \end{aligned}$$
(6)
$$\begin{aligned} \mathbf {C}^{f}_{s}&= \alpha \ \mathbf {C}^{f-1}_{s} + (1-\alpha ) \ \mathbf {C}_{s}, \end{aligned}$$
(7)

where the covariance matrix \(\mathbf {C}_{s}\) captures the variance contribution of the current relative target position \(\mathbf {u}^{f}_{s} = [u^{f}_{s}, v^{f}_{s}]\) with respect to the current mean \(\varvec{\mu }^{f}_{s} =[\mu ^{f}_{s,u}, \mu ^{f}_{s,v}]\):

$$\begin{aligned} \mathbf {C}_{s} = \begin{bmatrix} \left( u^{f}_{s} - \mu ^{f}_{s,u}\right) ^{2}&0 \\ 0&\left( v^{f}_{s} - \mu ^{f}_{s,v}\right) ^{2} \end{bmatrix}. \end{aligned}$$
(8)

An illustration of such a distribution is shown in Fig. 1a.

Supporter model application We use the supporter model to predict the target position \({\mathbf {x}}^{f}\) if the tracked target position \({\mathbf {x}}_t^{f}\) is not reliable (i.e., \(\rho ^{f} < \theta _{CC}\)). The most likely relative target location per supporter s is mean \(\varvec{\mu }^{f}_{s} = \arg \max _{\mathbf {u}} P^f(\mathbf {u}|{\mathbf {x}}_{s}^f)\), with corresponding probability \(P^f(\varvec{\mu }^{f}_{s}|{\mathbf {x}}_{s}^f) = 1/(2 \pi \sqrt{|\mathbf {C}_{s}^f|})\). Instead of predicting the target from the peak of the resulting Gaussian mixture model (GMM) distribution (see Fig. 1b for a 1D illustration), we use a weighted average of the mean values from all mixture components [16] and incorporate the reliability of the supporter position predictions, i.e., \(P^f({\mathbf {x}}_{s}^f|{\mathbf {I}}^{f}) = \rho ^f_{s}\). The prediction from all supporters is then

$$\begin{aligned} {\mathbf {x}}^{f}_{p}= & {} \frac{\sum _{s} (\varvec{\mu }^{f}_{s}+{\mathbf {x}}^{f}_{s}) P^f(\varvec{\mu }^{f}_{s}|{\mathbf {x}}_{s}^f) P^f({\mathbf {x}}_{s}^f|{\mathbf {I}}^{f})}{\sum _{s} P^f(\varvec{\mu }^{f}_{s}|{\mathbf {x}}_{s}^f) P^f({\mathbf {x}}_{s}^f|{\mathbf {I}}^{f})}\nonumber \\= & {} \frac{\sum _{s} (\varvec{\mu }^{f}_{s}+{\mathbf {x}}^{f}_{s}) \rho ^f_{s}/\sqrt{|\mathbf {C}_{s}^f|} }{\sum _{s} \rho ^f_{s}/\sqrt{|\mathbf {C}_{s}^f|} }. \end{aligned}$$
(9)

Finally, if the applied supporter model and the main object tracker agree on the target position estimation, i.e., \(P({\mathbf {x}}^{f}_{t}) = \sum _{s} P^f({\mathbf {x}}^{f}_{t}-{\mathbf {x}}^{f}_{s}|{\mathbf {x}}_{s}) \rho ^f_{s} \ge \theta _{P}\), then the estimation from the main tracker is used: \({\mathbf {x}}^{f} = {\mathbf {x}}^{f}_{t}\). Otherwise, we use the supporter prediction \({\mathbf {x}}^{f} = {\mathbf {x}}^{f}_{p}\).

An example for target position estimation using the supporter model is shown in Fig. 2.

Experiments and results

We evaluated our method using the 2D liver US image sequences provided by the Challenge on Liver Ultrasound Tracking (CLUST)-2015 [17]. A main advantage of supporters is the robustness to feature appearance in tracking, for instance, when a target is occluded by acoustic shadowing. Since such disappearing target locations are not (and cannot reliably be) annotated in the given dataset, we devised a simulation framework to emulate acoustic shadowing artifacts from the ribs on the images and evaluated this scenario. As the baseline object tracker, we employed [6] such that motion tracking with and without using the supporter model can be compared.

CLUST-2015 dataset

The CLUST-2015 dataset includes 2D liver US image sequences and consists of two subsets, namely training and test set. The sequences in the dataset have a duration between 60 and 330 s. The training set has 24 image sequences with manual annotations in 10% of all frames. The annotations are mostly for vessel cross sections in the liver, which are reliable landmarks for liver motion. The test set contains 24 image sequences with no public annotations apart from the reference positions \({\mathbf {x}}^0\), and the submitted results are evaluated by the challenge organizers. For the evaluation, the Euclidean distance between each manual annotation and the corresponding tracked point is computed, where summary error statistics including mean, standard deviation, and 95%ile errors are reported to the participant. In this work, we are particularly interested in reducing 95%ile errors to minimize large errors for a robust tracking performance throughout all sequences.

For parameter optimization and sensitivity analysis, we used the training set. Our method has four parameters to optimize, which are forgetting factor \(\alpha \), correlation coefficient threshold \(\theta _{CC}\), supporter model update threshold \(\theta _{\textit{update}}\) and target probability threshold \(\theta _{P}\). We optimized these parameters for minimizing 95%ile error with leave-one-out cross-validation using grid search. Optimal parameters range from \([\alpha ,\theta _{CC},\theta _{\textit{update}},\theta _{P}] = [0.90,0.3,0.3,0.5]\) to [0.95, 0.3, 0.4, 0.7] and hence are relatively insensitive to the left-out case. The mean parameters were found to be \([\alpha ,\theta _{CC},\theta _{\textit{update}},\theta _{P}] = [0.9479, 0.3000, 0.3021, 0.6625]\). Figure 3 shows the mean, 95%ile, and maximum tracking error distributions from the 24 sequences of the baseline method (abbreviated as TMG for Tracking by Makhinya and Goksel) and our proposed tracker (denoted as RMTwS for Robust Motion Tracking with Supporters). Table 1a compares overall performance for the mean, standard deviation, 95%ile, and maximum error after pooling all training results into one distribution. Note that our proposed method yields a 16% improvement for the 95%ile error.

Fig. 3
figure 3

Tracking error distributions (in mm) for baseline (TMG) and proposed method (RMTwS) for 24 training sequences. a Mean tracking error. b 95%ile tracking error. c Maximum tracking error

Table 1 Comparison of mean, standard deviation, 95%ile and maximum of tracking errors (in mm) of baseline (TMG) and proposed (RMTwS) method after pooling all results from a training and b test set
Fig. 4
figure 4

Tracking error distribution (in mm) for baseline (TMG) and proposed method (RMTwS) for 24 test sequences. a Mean tracking error. b 95%ile tracking error. c Maximum tracking error

We then applied our method on the test set using the optimal parameters found above. Test set results were evaluated by the challenge organizers. Figure 4 compares tracking error distributions of the baseline tracker, TMG, and our proposed tracker, RMTwS, for the 24 test sequences, and Table 1b lists the overall performance after pooling all results. RMTwS yields 1.04 mm mean and 2.26 mm 95%ile error, improving the baseline method by 4.6 and 6.6%, respectively. The 95%ile error of the individual test landmarks was improved by more than 5% for seven landmarks and by >30% for five landmarks. The remaining landmarks have accuracies within 2%.

Fig. 5
figure 5

Shadow simulation example with a signal intensity map, b original image, c shadowed image

We also evaluated the time needed to run our proposed method. Learning and applying the supporter model take between 20 and 60 ms per frame in the given sequences on an Intel Core i7–4770K CPU @ 3.5GHz.

Evaluating tracking under shadowing

Since the target points which disappear in the acoustic shadow are not annotated in the CLUST-2015 dataset, we conducted a simulation, where we emulated acoustic shadowing artifacts from a simulated rib on the images and evaluated this scenario. For this purpose, we manually placed a structure of size 12.4 mm \(\times \) 7.2 mm, representing a rib cross section in accordance with [18], close to the skin.

We augmented each frame in a US image sequence from the training data with new ultrasound bone shadows by multiplying the input US images with a signal intensity map. For each pixel of an ultrasound image, this map stores the accumulated intensity of the ultrasound signal induced by reflection at the bone surface and energy loss (attenuation) within the bone structures. It is between [0, 1], with 1 for the original signal intensity and 0 for a complete signal loss. The signal intensity map is generated in a multistage process. In the first step, we create a map of attenuation coefficients \(\mathbf {Z}\) of bone cross sections, given by intersection of the bone tissue with the transducer plane. To create a bone segment j, we simply rasterize a circle with radius \(r_j\) at position \(p_j\) in \(\mathbf {Z}\). Inside each circle, we store attenuation coefficients \(\mathbf {Z}(x,y)=\beta _j\) corresponding to bone segment j, and \(\mathbf {Z}(x,y)\) is zero otherwise. Typical values of \(\beta \) for bone are used from literature [19].

In the next step, we use ray marching to traverse \(\mathbf {Z}\) and create a (pre-scan-converted) signal intensity map \(\mathbf {A}\), in a simplified and task-specific variation of more complex ultrasound simulation method [19]. In particular, we traverse the columns (scanlines) of \(\mathbf {Z}\) from top to bottom (y-direction). During this, we record a reflected signal intensity at the bone surface and energy loss thereafter and accumulate the attenuation coefficients in \(\mathbf {Z}\). At each step of the ray marching process, the current pixel \(\mathbf {A}(x,y)\) is computed as \(\mathbf {A}(x,y)=\mathbf {A}(x,y-1)\exp (-\mathbf {Z}(x,y))\).

The resulting signal intensity map is finally filtered with a Gaussian function to emulate the blurring due to convolution with the ultrasound point spread function (PSF). Since the input images are from a convex probe, the map is scan-converted from a radial domain into a Cartesian frame, using the scan conversion parameters estimated geometrically from the original image. This yields the typical ultrasound shadow appearance in convex probe images, where the shadows become softer and wider in the far field of the images. This provides simulated image data with ground truth for evaluating tracking under shadowing. Example images of a signal intensity map, an original image, and the resulting shadowed image are shown in Fig. 5.

After generating a 2D US image sequence containing shadow, we applied the baseline and our method to the new sequence. For that, we used the same optimal parameters as for the CLUST-2015 test set, obtained by leave-one-out cross-validation. The mean errors for TMG and RMTwS were 2.79 and 2.61 mm, with 95%ile errors of 12.11 and 10.29 mm. This indicates a 6.5% (15%) improvement in mean (95%ile) error. Examples of tracking performance with and without shadowing for inhale and exhale phases of the breathing cycle are shown in Fig. 6.

Fig. 6
figure 6

Example of tracking performance a, c without and b, d with shadowing for a, b inhale and c, d exhale breathing phase, showing improved robustness of the proposed method, RMTwS, in (d)

Discussion and conclusions

We have demonstrated an ultrasound tracking method using supporters, RMTwS, where image locations other than the target are also tracked in order to exploit motion consistency with such surrounding tissue for improving tracking robustness. We employed an optic flow- and vessel model-based tracker, TMG, as our baseline as well as for tracking the target and supporter locations to then learn and apply the supporter model using these initial estimations. In this work, we are particularly interested in reducing 95%ile errors to ensure effective tracking performance throughout all frames in order to minimize 95%ile therapy margins for more focal therapies and reduced collateral damage to healthy tissue.

Our evaluations using the training and test sets show that the proposed method, RMTwS, can track targets more accurately than the conventional object tracker, TMG. The resulting performance is 1.04 mm mean and 2.26 mm 95%ile errors. This 95%ile tracking performance is relevant in liver motion tracking for radiation and focused therapy applications, when compared to 1.23 mm mean inter-observer 95%tile variability reported for a similar dataset in [17].

The accuracy improvements seem to be small for mean and 95%ile error when taking all trajectories into account. This is because the main object tracker already performs quite well in most cases and fails only in certain situations such as under shadowing. All the same, to enable a satisfactory therapy for every patient, a tracking method should be robust for all scenarios.

Optimal thresholds for updating the supporter model, \(\theta _{\textit{update}}\), and the reliability of the tracking performance, \(\theta _{CC}\), were found to be very close. A supplementary experiment showed that the tracking performance difference using \(\theta _{\textit{update}} = \theta _{CC}\) is insignificant. Thus, one can use the same parameter for \(\theta _{CC}\) and \(\theta _{\textit{update}}\).

Our proposed method applies the learned supporter model in 12% of the frames, which indicates that the reliability of the tracking performance by TMG is not always high. The main advantage of using supporters for tracking is the robustness in scene or target appearance changes over time, such as due to acoustic shadowing. Since there exist no annotations for such cases in the given dataset and this scenario cannot be evaluated using the current setting, we devised a simulation framework to imitate acoustic shadowing artifacts on the images in a 2D sequence. This simulated experiment showed that without additional optimization for such a scenario, the proposed method improves the 95%ile tracking performance of the baseline by 15%.

On each sequence two to three supporters were used, which is not a large number since there are only a few easily identifiable landmarks in these images. We aim to study automatic landmark detection in the future to automatically identify a (potentially larger) number of supporters, also yielding a interaction-free framework. Additionally, with more supporters available, we plan to conduct a sensitivity analysis regarding their number and locations.

There are several locations in the liver such that the motion of some can be used to imply the motion of the others. However, this method requires that the target and supporter motion are coherent, and there exists a model to infer the former from the latter. The coherence is already checked for the supporter model building process. But there could be more complex models such as finite element (FE) models. If we were to use a continuum mechanics-based approach with our current position priors, extrapolating points outside the FE supporter mesh might not have been robust.

This study is the first demonstrating the benefits of employing supporters for US tracking. Given the target and supporter position estimations from the main object tracker, learning and applying the supporter model take less than 20 ms, where correlation coefficient calculation takes most of it. The resulting tracking technique has a near real-time tracking performance with 22.5 frames per second (fps) on average. As such, it is the state of the art in the CLUST2015 challenge for real-time tracking of liver motion in 2D ultrasound sequences, as the winner of this challenge could only achieve on average 4.8 fps (469% slower) with its mean (95%ile) error being 12.5% (2.6%) better. Hence, we based our method on this real-time liver tracker.

In a practical application of our method for radiation therapy, a 2D convex transducer can be used to image the liver reaching below the ribs. On an initial (reference) frame, an operator would then mark the target (e.g., tumor center) location, as well as few other easy to track locations (supporters). Tracking would then run during the treatment, while the target location estimates can be used to gate or compensate for patient motion.