Keywords

1 Introduction

Anatomical landmark detection plays an important assisted role in many medical image analysis tasks, such as organ segmentation, registration and vessel extraction [1]. However, for accurate landmark detection, there still remain many challenges: (a) anatomical differences between patients are widespread, (b) while detecting multiple landmarks simultaneously, spatial constrains among landmarks should be taken into account, (c) detection of 3D anatomical landmarks aggravates the computational cost intensively, making real-time application challenging, (d) limited annotated training data available restricts algorithmic design typically. Although many methods have been proposed [2,3,4,5], there is still room for improvement. Among these methods, our method is more related to [3, 4].

For landmark detection, an intuitive patch-based approach is to regress displacements from patches center to the target landmark [3]. Then the landmark position is calculated by these displacements following a majority/average voting strategy. Trained by numerous patches, it is possible to design deep networks which can capture discriminative information and perform better than the shallow ones. Nonetheless, these methods always focus on local appearance merely and global information is not well utilized. The large number of patches also leads to a heavy computational burden. For improvement, Noothout et al. [6] proposed a model performing classification and regression jointly, in which only displacements of patches classified as containing landmarks contributed to the final result.

Another interesting method is based on regressing heatmaps [4]. With entire image as input, these models are supposed to output synthetic heatmap, denoting the probability of each voxel belonging to the target landmark. The prediction position is simply chosen to be the output voxel with the maximum temperature. Apparently, they can utilize global context information and have good spatial generalization. However, the input volume shrinks in methods using FCN [7], which causes theoretical lower bound of prediction error. For instance, output heatmap of size 128 with input of size 512 leads to 3 voxels error at most. Furthermore, the total number of network weights for 3D medical images increases intensively, making the training difficult with limited training data at hand.

Combining the advantages of the two methods above, we propose a cascade regression model combining heatmap regression and displacement regression. The proposed method makes coarse-to-fine prediction, taking entire image in lower resolution and patches in higher resolution as input respectively, which combines global information and local appearance. The spatial relationships among landmarks are also taken into account by learning long-range context, which improves overall performance. The cascade structure is similar to the method of He et al. [9], in which the facial landmark localizations were refined via finer and finer modeling. In contrast, instead of the deep CNN, a carefully designed heatmap regression model is exploited to make initial prediction in our method. Besides, the local patches are extracted as input in the subsequent stage [10], rather than entire image in [9].

We evaluated our method on the coronary and aorta CTA images by detecting 5 and 9 anatomical landmarks respectively. These landmarks are of great clinical significance: cardiac landmarks contribute to diagnosis, prognosis, and therapy of cardiovascular diseases [1]; detection of aortic landmarks is an effective assistant tool in aortic vascular modeling [6]. The results demonstrate our method is competent for the cardiac and aortic landmark detection task and achieves performance comparable to the state-of-the-art approach [6].

Fig. 1.
figure 1

The overview of our cascade regression model.

2 Proposed Method

Figure 1 illustrates the overall cascade regression model framework for single landmark detection. We show the 2D case for clarity but the model works similarly in 3D. In the first stage, a modified U-Net is employed to get a relatively accurate initial localization, taking the entire image in lower resolution as input and heatmaps as output. Owing to the skip architecture, this module can capture multi-scale knowledge. Aiming to learn more precise context information, in the second stage, the patch centered at initial localization in higher resolution is extracted and fed to the displacement regression model. The CNN adjusts the initial localization by moving it toward ground truth position. The different sizes and resolutions of two stages emphasize that they focus on long-range context and local appearance, respectively.

2.1 Primary Prediction

We exploit heatmap regression to make the first stage prediction. In this scheme, each landmark has a separate output channel where a Gaussian heat spot is centered at its location. During inference, the predicted position is simply determined by the maximum response. Following the principle of classification, for \(N_l\) landmarks, the model is trained for \(N_l+1\) channels, where the first \(N_l\) channels describe the probability belonging to the corresponding landmark and the last channel belonging to background. Particularly, considering that softmax operation may influence the status of landmark positions in heatmap ground truth (e.g. for 5 landmarks, the values of 1th landmark in 6 channels are changed from (1, 0, ..., 0) to (0.35, 0.13, ..., 0.13) after softmax, which can be smaller than its neighbors), we adjust the sum of all channels to 1 by fixing the background channel and scaling the others.

The temperature \(t_i\) for ith landmark (i.e. ith channel) can be defined as:

$$\begin{aligned} f(x)= \left\{ \begin{array}{cc} k\mathrm{exp}({\frac{-(v-p_i)^2}{2\sigma ^2}}), &{}{i=1,2,3,...,N_l},\\ 1-k\mathrm{exp}({\frac{-(v-p_{closest})^2}{2\sigma ^2}}),&{}i=N_l+1. \end{array} \right. \end{aligned}$$
(1)

The heatmaps of first \(N_l\) channels are determined by the distance from the voxel v to the landmark position \(p_i\), while the heatmap of background channel is according to the closest landmark position \(p_{closest}\). \(\sigma \) is standard deviation and k is Gaussian height.

As shown in Fig. 2, our model realizes this scheme by customizing the original 3D U-Net [8]. Similar to its standard version, the network is comprised of 3D convolution, max-pooling, deconvolution (up sampling) and short-cut connections from layers in contracting path to the ones in expansive path with equal resolution. Each convolution layer follows ‘same mode’ (i.e. ouput has the same size as input) and uses RELU activation function. The model takes entire downsampled image as input and outputs heatmap volumes. Benefiting from the natural superiority of U-Net, the model can capture long-range context information, where the spatial relationships among landmarks can also be taken into account, increasing overall accuracy.

Aiming to tackle the problem of class imbalance, namely heat spot only occupies a small proportion of volume, we employ a weighted mean squared error (MSE) loss function between the predicted and ground truth heatmaps. The weights are chosen to be the exponential powers of the predicted values in the output. On the other hand, to deal with gradient vanishing problem, we shorten the backpropagation path of gradient flow signals by incorporating three side-paths auxiliary loss. The final formulation of loss function is expressed as:

$$\begin{aligned} \mathcal {L}(P;H^{GT})=\mathcal {L}_{mse}(P;H^{GT})+\sum _{s=1,2,3}\beta _s\mathcal {L}_{mse}^s(p^s;H^{GT}) \end{aligned}$$
(2)

where \(H^{GT}\) is the ground truth heatmap, P is the final output, \(\beta _s\) is the weight of different side-path \(p^s\) and set as 0.3, 0.6, 0.9 corresponding s as 1, 2, 3.

Fig. 2.
figure 2

The architecture of the proposed model in the first stage.

2.2 Refinement Strategy

In the second stage, we propose a CNN model to refine the primary prediction. Given the first stage model taking the entire image as input, we assume that landmarks should be distributed around the initial prediction. The CNN takes patches in original resolution centered at the inital prediction to capture more precise local information. Considering that local appearance of certain landmarks may be ambiguous (e.g. locally similar vascular structures), we restrict this stage model to change the initial prediction in a small range.

The CNN is trained to predict the displacement vector \(\triangle S\) from the primary prediction \(S_0\) to the true landmark position \(S^{GT}\). Given a volume V, a training sample is represented by \((\varGamma (V,q), \triangle S^{GT})\) where q is a point randomly sampled around \(S_0\) in a small range from V and \(\varGamma (V,q)\) is its associated patch. The ground truth displace vector \(\triangle S^{GT}\) is given by \(\triangle S^{GT}=S^{GT}-S_0\). During inference, patch \(\varGamma (V,S_0)\) is fed to the model and the final prediction is obtained by \(S=S_0+\triangle S\). The CNN is trained by minimising Euclidean loss between the predicted and the true displacement vector.

As shown in Fig. 3, the CNN model contains 4 convolutional layers followed by max-pooling layers, and 2 fully-connected layers. Each layer except the last one employs RELU activation function. Considering that certain landmarks may have distinct appearance than the others (e.g. the apex cordis), we refine them separately. That is, we train a refinement network per landmark. Since the CNN is trained by patches, a small number of training data is sufficient in this stage.

Fig. 3.
figure 3

The architecture of the displacement regression module in the second stage.

3 Experiments and Results

3.1 Data and Experiment Settings

We evaluated the proposed method on the two datasets of coronary and aorta CTA images. As shown in Figs. 4 and 5 cardiac landmarks and 9 aortic landmarks are annotated manually by a expert. For both datasets, we do not apply data augmentation such as scaling and rotation, which may increase the complexity of landmark distribution.

Coronary dataset is randomly divided into training data with 75 scans and test data with 40 scans. All volumes were zero-padded to 512 \(\times \) 512 \(\times \) 512 voxels with isotropic voxel size 0.4 mm. Then they were downsampled 4 times and fed into the model in the first stage. In the second stage, patches size 64 in the original resolution were extracted and the batch size was set to 4. The model was trained using Adam with a learning rate of 0.001 for 11,250 and 45,000 iterations in the two stages, respectively.

Aorta dataset consists of training data with 25 scans and test data with 23 scans. which has an average size of 512 \(\times \) 512 \(\times \) 777 voxels, with a voxel size of 0.71\(\times \)0.71\(\times \)0.81 mm\(^3\). The annotated landmarks are located at the bifurcation of the aorta and its main branches. Considering that aortic landmark detection is more challenging due to its low resolution and complex organ distribution, the volumes were manually cropped first and downsampled 2 times to fed into the first stage model. The rest of the training process is similar.

Fig. 4.
figure 4

Landmarks defined on the coronary and aorta CTA images.

3.2 Results

Summary metrics obtained by different networks on the coronary dataset are listed in Table 1. We use average Euclidean distance between ground truth and estimated landmark positions as evaluation measure. We first compared two-stage cascade model and only the first stage model. After refinement, the detection accuracy improves significantly, demonstrating the benefit of our cascade architecture.

To demonstrate that integrating spatial relationships among landmarks can improve overall performance, we adjusted the model in the first stage to take patches size of 48 as input instead of entire image. In this way, the network can only utilize the context information around one landmark at a time. It was trained to predict heatmap patch according to the input. The predicted position was determined by the maximum response in the volume composed of predicted patches. The experiment results show our method in the first stage performs better overall. Specifically, the patch-based network is superior in detecting the left coronary ostium and the origin of the non-coronary aortic valve commissure, which may be more dependent on precise context information. On the other hand, our proposed model performs much better in detecting the right coronary ostium and the bifurcation of the LM, where the relationships among landmarks are probably necessary for accurate detection (e.g. the position of the left coronary ostium is important for localizing the bifurcation of the LM).

Table 1. Average Euclidean distance errors expressed in mm, for the detection of 5 cardiac landmarks on the coronary dataset. The results are obtained by the two-stage model and the first stage model only, which takes either patches or entire image as input, comparing with the algorithms of Noothout et al. [6].
Table 2. Average with standard deviation Euclidean distance errors in mm for the detection of 14 landmarks on the two datasets by the proposed algorithm.

Furthermore, we compared our model with the method of Noothout et al. [6], which detected 6 anatomical landmarks in cardiac CT scans (4 of them are the same as us). The metrics are quoted directly from [6] since that dataset is not publicly available. Although our dataset is different from that in [6], we can conclude that the performance of the proposed algorithm is at least comparable to [6].

Table 2 lists more detailed metrics of detection for each landmark on the two datasets using our algorithm. The high detection error of aortic landmarks is due to the low resolution of aortic images. In the model design, we do not utilize unique atlas information related to coronary or aorta, which guarantees the method capable for the anatomical landmark detection tasks in different regions of the human body. Some visual results are shown in Fig. 5.

Fig. 5.
figure 5

Visualisation of landmark detection in coronary and aorta images by the cascade regression model. The ground truth and predictions are indicated by green and red dots, respectively. (Color figure online)

4 Conclusion

We have proposed a two-stage cascade regression model for detecting anatomical landmarks in coronary and aorta CTA images. Owing to different sizes and resolutions of input in two stages, the model combines the global information and local appearance. By learning long-range context, the spatial relationships among landmarks are also taken into account, increasing overall performance. The experiment results demonstrate that our method achieved performance comparable to the state-of-the-art algorithm [6]. Limited by memory and computation time, we used downsampled image in the first stage. It is foreseeable that the model would gain better performance with images of higher resolution as input. Another limitation is we only have one annotator, which makes it impossible to assess inter-observer error for landmarks. It is also worthwhile to apply multi-stage refinement to capture more precise information. The experiment results have demonstrated that our method is generic for anatomical landmarks detection and the next step is to extend it to other medical images.