Keywords

1 Introduction

Coronary artery disease (CAD) is one of the most prevalent critical cardiovascular diseases with up to 32% mortality rate [18]. The CAD diagnosis necessitates reconstructing a 3D coronary artery tree, e.g., from CCTA images, so that the diagnosis decision could be finalized according to the vascular anatomical information, e.g., annotations of vascular branches and vascular morphological properties [7]. However, conventional reconstruction methods merely exploit the images obtained from the diastolic phase that only reveals partial coronary arteries [1, 15, 22], which potentially makes vessel lesions invisible, i.e., misdiagnosis.

In fact, a cardiac cycle has two phases, i.e., diastole and systole. The reconstructed arteries in the two-phased CCTA images are incomplete coronary trees, but they complement each other. By accurately aligning the arteries in both phases, the complete coronary tree can be reconstructed. Nevertheless, there are three challenges for successful coronary reconstruction. 1) Since the heart beats vigorously, its surrounding arteries can be squeezed by heart chambers and become invisible in one of the phases, easily causing the misalignment of a significant number of arteries in the two-phased images (short for component variation), as pointed by yellow (visible in diastole only) and cyan (visible in systole only) arrows in Fig. 1(a). 2) Arteries deform along with heartbeats, their shape, size, and location may vary significantly across the two phases, causing difficulties in alignment, as demonstrated in Fig. 1(b). 3) Arteries are tiny tubular tissues, which only occupy a very small part (\(\le 0.5\%\)) of the whole CCTA image (Fig. 1(c)), causing imbalance issues for image-based registration methods.

Fig. 1.
figure 1

Three challenges of coronary artery registration.

For vessel registration, there are mainly three main branches of methods, i.e., image-based, point-cloud-based, and hybrid-based registration. Image-based methods utilize image features to register the entire volume, and the obtained deformation field is then used to align vessels to the target space. Those methods have been extensively applied to the registration of coronary arteries [14, 16], pulmonary vessels [13, 17], cerebral vessel [10], heart chamber [11], etc. Although those methods demonstrate promising performance on the whole image scale, the vessels are not necessarily well-aligned and cannot be employed to reconstruct the complete coronary tree. By contrast, point-cloud-based registration directly aligns the vessels, which are firstly labeled or segmented from CCTA images and then modeled as point clouds for registration. For example, point-cloud networks [20, 21] or graph convolutional networks [24] commonly exploit geometric features of the vascular point-cloud, which are more flexible and accurate than those image-based methods. The limitation of those point-cloud-based methods mainly involves the disability in geometric feature representation to distinguish the arteries, because different arteries or artery branches can share very similar morphology [12]. Similarly, the hybrid-based methods [4, 8] also extract the vessel masks in the images for registration, but the lack of effective image information limits its performance. Integrating the advantages of both domains (image and point cloud) may produce improved outcomes, but has not yet been explored.

In this paper, we propose a structural point registration network (SPR-Net) to align coronary arteries from the systolic and diastolic phases. The SPR-Net is designed to exploit both image-based and point-cloud-based features, in which the image and point cloud are encoded as intrinsic features. Additionally, we propose a transformer-based feature fusion module to fully exploit the obtained intrinsic features in extracting structural points, i.e., key points that delineate the anatomical morphology of arteries across the two phases and are solely used to compute the deformation field. For those obtained structural points, a simple thin-plate spline [5] method is employed to align coronary arteries of systole and diastole. Extensive experiment results demonstrate the superiority of our method over eight methods (Fig. 2).

Fig. 2.
figure 2

Overview of the proposed framework consisting of four components: 1) Geometric feature learning module for point cloud; 2) Image semantic feature extraction by ViT modules; 3) Geometric and image semantic feature encoding by transformers; 4) TPS-based dense deformation field interpolation.

2 Method

We propose the SPR-Net method, which simultaneously utilizes geometric features extracted from point clouds and image features extracted from CCTA images with the goal of generating structural points to align arteries across systole and diastole, with shape, location, and component variations. In this section, we first introduce the extraction of geometric features (Sect. 2.1), then the extraction of image features (Sect. 2.2), next the extraction of structural points and their usage in registration procedures (Sect. 2.3), and finally the loss function (Sect. 2.4).

2.1 Geometric Feature Learning

The coronary arteries share a tubular structural shape. The point cloud network has the advantages of effectively learning the spatial geometric shape of arteries and providing accurate relative positional relationships of points [24], so that the obtained point features are more discriminative. Inspired by [6], we employ the point cloud encoder, with the same structure as [6] that composes three layers (i.e., sampling layer, multi-scale grouping layer, and PointNet layer), to extract the geometric features of each point.

Given the input diastolic and systolic point clouds P and Q, we first use a sampling layer in the point cloud encoder to obtain the down-sampled points \(\bar{P}=\{\bar{p}_1,\bar{p}_2,\cdots , \bar{p}_m\}\) with \(\bar{p}_i\in R^3\) and \(\bar{Q}=\{\bar{q}_1,\bar{q}_2,\cdots , \bar{q}_m\}\) with \(\bar{q}_i\in R^3\), respectively. \(\bar{P}\) and \(\bar{Q}\) are then filled into the multi-scale grouping layer to aggregate its neighboring points within different radii r. After that, the multi-scale aggregated points are fed into the PointNet layer to extract geometric features.

2.2 Geometric and Image Feature Encoding

Point clouds can provide good geometric shapes and spatial location information, but they lack sufficient semantic features of coronary arteries. Meanwhile, the images contain rich contextual information that can complement the geometric features. Therefore, we design a transformer-based module to integrate both advantages. Specifically, 1) we employ a shallow 3D vision transformer (ViT) [9] to extract image features of the artery; and 2) we employ general transformers [19] to fuse image features and geometric features extracted by the point-cloud encoder.

1) Image Feature Extraction. For efficiency, we only crop image blocks of size \(h\times w \times d\), with each point as the centroid, and the ViT block is employed to extract local features. Since these blocks are extracted along the tubular structures, the extracted local features reveal intrinsic relationships. To exploit their correlations, we employ a self-attention mechanism-based transformer. The coordinates of each point serve as the position encoding, which is added to its local image feature as the input to the following transformer blocks.

$$\begin{aligned} \ f^{img}_i{(a, b, c)} = E_i(a, b, c)+ I_i(a, b, c) \end{aligned}$$
(1)

where \(E_i\) and \(I_i\) respectively indicate the position encoding and image features for the i-th rectangular volume. \((a, b, c)\in R^3\) is the point coordinates. \(f^{img}_i\in R^l\) is the self-attention input of transformer layer, and l is the feature dimension.

2) Geometry and Image Co-embedding. Given concatenated features of pointwise and image features, four transformer layers are employed to further explore comprehensive contextual features between the two phases. The transformer layer incorporates an encoder and decoder block, which are based on a multi-head attention mechanism. We use the concatenated features of the diastolic phase as input to the transformer encoder and decoder respectively, and the opposite for the systolic phase, to learn the feature dependencies between the two phases.

2.3 Registration via Structural Point Correspondences

1) Integration of Structural Points. The input of MLP is the contextual features extracted by the transformer, and the output is the probability of each point. Specifically, given the sampled points \(\bar{P}\) with the fused features \(F_P\) from diastole, we input the features into the shared MLP to generate the probability maps \(V_p = \{ v_1,v_2,\cdots ,v_k \}\) with \(v_i\in R^m\). Thus, the diastolic structural points \(S_p\) can be calculated as follows:

$$\begin{aligned} s_i^p=\sum _{j=1}^m \bar{p}_j v_i^j \quad \text{ with } \quad \sum _{j=1}^m v_i^j=1 \quad \text{ for } \text{ each } ~~i \end{aligned}$$
(2)

Note that, the systolic structural points \(S_q\) are calculated in the same way as the diastolic structural points.

2) Structural Points based Registration using TPS. Based on the correspondence established between the structural points \(S_p\) and \(S_q\) in the two phases, we apply a simple but effective idea of the TPS method to interpolate the dense deformation field. For the two sets of structural points, \(S_p\) and \(S_q\), the nearest projection from structural points \(S_q\) to the \(S_p\) is calculated, and the \(S_q\) is warped to the \(S_p\) in the diastolic phase. Eventually, each systolic point is re-meshed by the closest point to the structural point and further warped to the original points Q using the estimated dense deformation field.

2.4 Loss Function

We design a structure-constrained registration loss for SPR-Net,

$$\begin{aligned} {L}_{total} = L_{rec}(S_p,P)+L_{rec}(S_q,Q)+L_{rec}(S_p,S_q) \end{aligned}$$
(3)

where,

$$\begin{aligned} L_{rec}(X,Y) = \frac{1}{|X|}\sum \limits _{x_i \in X}\min \limits _{y_j \in Y}{\Vert x_i-y_j\Vert ^2_2} + \frac{1}{|Y|}\sum \limits _{y_j \in Y}\min \limits _{x_i \in X}{\Vert y_j-x_i\Vert ^2_2} , \end{aligned}$$
(4)

Here \(L_{rec}\) is chamfer distance, and X and Y denote two point clouds respectively. The first part \(L_{rec}(S_p, P)\), and the second part \(L_{rec}(S_q, Q)\) assure the predicted structural points in two different phases are close to their corresponding original point clouds. The third part \(L_{rec}(S_p, S_q)\) encourages an accurate alignment of structural points between the two phases, ensuring that structural points with the same semantics align on the same vessel branch.

3 Experiments and Results

3.1 Dataset and Evaluation Metrics

Data Processing. In our experiments, we collected 58 pairs of CCTA images with both diastolic and systolic phases. All coronary artery masks are first extracted using [25] and refined by three experts. Then, the annotated arteries were down-sampled and modeled as 3D point clouds; meanwhile, their coordinates were normalized to the range of [0,1]. We choose the five-fold cross-validation evaluation strategy, with 40 training subjects and 18 testing subjects.

Evaluation Metrics. Since the artery branches of systole and diastole only partially overlap, i.e., some coronary branches only appear in one phase, we define a common Dice coefficient (CoDice) to accurately evaluate the results.

$$\begin{aligned} {\text {CoDice}}(P_o, Q_o)=\frac{2|P_o \cap Q_o|}{|P_o|+|Q_o|} \end{aligned}$$
(5)

where \(P_o\) and \(Q_o\) denote the set of coronary branches common to diastolic and systolic phases, respectively. Moreover, the Dice coefficient (Dice), Chamfer distance (CD), and Hausdorff distance (HD) are also employed for evaluation.

3.2 Implementation Details

The initial inputs of the SPR-Net contain 4096 point clouds for each phase, and a volume size of \(16 \times 16 \times 8\) is cropped around each point. The point cloud encoder consists of two set abstraction blocks with 1024 and 256 grouping centers respectively. In each set abstraction block, we utilize the grouping layer with two scales r to combine the multi-scale features, containing scales (0.1, 0.2, 0.4) and (0.2, 0.4, 0.8) respectively. The transformer blocks we used are composed of vanilla transformer layers. The outputs of the point cloud encoder and ViT have 512-D and 128-D features, respectively, which are concatenated together to form 640-D contextual features. The configuration of the MLP block in the structural point integration depends on the number of structural points. All experiments were implemented using Pytorch on 1 NVIDIA Tesla A100 GPU. We trained the networks using Adam optimizer with an initial learning rate of \(10^{-4}\), epoch of 600, and batch size of 8.

3.3 Comparison with State-of-the-Art Methods

Our SPR-Net was quantitatively and qualitatively evaluated, compared with eight SOTA registration methods, which belong to three categories:1) image-based registration, including SyN [2], VoxelMorph [3], and DiffuseMorph [11]; 2) hybrid-based registration, TMM [8]; 3) point-cloud based registration, including Go-ICP [23], DCP [20], STORM [21], and ISRP [6].

Quantitative Results. The quantitative results are listed in Table 1. We can find the superiority of point cloud-based methods if compared to image-based methods, which supports the previous conclusion about the limitation of image-based methods. We can also find that our proposed method significantly outperforms other methods since SPR-Net fully encodes and fuses features of the images and point clouds. Notably, SPR-Net achieves significantly better performance than ISRP, the closest competing method, with an improvement of 10% (i.e., increasing Dice from 58.31% to 68.58%).

Table 1. Results of comparison experiments.
Table 2. Quantitative results of ablation analysis of different components.

Qualitative Visualization. Since the correspondence of structural points is vital for registration, we show the structural points (colored) in systole (green) and diastole (red) in Fig. 3 for demonstrating their correspondence. Those structural points with correspondence to the same vascular branch are marked by the same color denoted by the dashed boxes in the 2nd column of Fig. 3. Notably, we can find that the structural points are distributed at positions such as the endpoints or bifurcation points, as shown in the 1st and 2nd columns, which properly delineate the morphology of the point clouds when the number of structural points is small. With increased number, structural points do not only locate at endpoints or bifurcation points but also diffuse along the vessel branches, forming the vessel skeleton, as shown in the 3rd and 4th columns of Fig. 3. In the 5th column of Fig. 3, a complete coronary tree is obtained by exploiting the registration (\(K=768\)).

Fig. 3.
figure 3

Structural points and registration results of two subjects (a and b), and point clouds in green and red denote systole and diastole phase, respectively. From the 1st to 4th columns, the number of structural points generated increases, and the colors of structural points denote the correspondence across the two phases. The last column shows the results before and after registration according to the correspondence of the structural points. (Color figure online)

3.4 Ablation Study

We also conduct the ablation studies with the same backbone point cloud encoder by following three groups of configurations: 1) Whether using the four transformer layers, denoted as CoF, to encode and fuse the systolic and diastolic geometry. 2) Whether fusing the geometry features of point cloud with image-level semantic features, denoted GIF. 3) Testing the network on different numbers of structural points (Number-SP). Table 2 summarizes the ablation study results.

If without employing CoF and GIF, only the backbone encoder is used to generate structural points. 1) With the same 768 structural points, we can find the individual modules of CoF and GIF can both improve the Dice performance. Meanwhile, combining the two modules lead to the best performance, which may suggest the importance of fusing the two different aspects of features. 2) By equipping both CoF and GIF, we can find that SPR-Net’s performance has been improved when the structural points number increases from 256 to 768. However, the performance decreases when it is further increased to 1024, indicating that dense structural points negatively affect the results, which is probably caused by the increasing number of outlier points. It can also be found that SPR-Net demonstrates inferior performance than both backbone+CoF and backbone+GIF when using 256 structural points, which is probably caused by the sparsity of structural points that are largely located at the endpoints and bifurcation positions, which cannot well delineate the morphology of vessel tree. Therefore, the number of structural points is a key parameter that affects registration performance. Through extensive experiments, we determine the optimal number of structural points to ensure one-to-one correspondences between diastole and systole (Table 2).

4 Conclusion

In this paper, we have proposed an intrinsic structural point learning-based framework for systolic and diastolic coronary artery registration. The framework identifies structural points in the arteries across the two different phases using both the spatial geometric features extracted by the point cloud network and the complementary image semantic information extracted by ViT. By strategically fusing the image and point geometric features through a transformer, structural points with strong correlations in two different phases are extracted and used to guide the registration process. Compared with the existing image-based registration methods and point cloud-based methods, our integrated method achieves superior performance and outperforms the state-of-the-art methods by a large margin, which suggests the potential applicability of our framework in real-world clinical scenarios for CAD diagnosis.