Keywords

1 Introduction

TMJ is a synovial joint that contains an articular disc (shown in Fig. 1a), which allows for hinge and sliding movements [1]. TMD is an umbrella term covering pain and dysfunction of the muscles of mastication (the muscles that move the jaw) and the TMJ. TMD is common in adults; as many as one third of adults report having one or more symptoms, which include jaw or neck pain, headache, and clicking sound or grating within the joint. Although TMD is not life-threatening, it can be detrimental to quality of life; this is because the symptoms can become chronic and difficult to manage. In addition to the observer’s expertise, clear image information is a substantial factor that leads to correct diagnosis of this intractable disease [2]. The articular disc is best depicted on magnetic resonance imaging (MRI) and osseous surfaces are best seen in cone beam CT (CBCT). The fused MRI-CBCT image (see Fig. 1b) provides desirable complementary information of the articular disc and condyle surfaces for optimum diagnosis. The registration process to generate fused images has been shown to be accurate and reliable in TMD assessment [3].

Unlike registration of single-modality images, multi-modality images registration between MRI and CBCT is challenging due to significant differences in voxel size, pixel intensity, anatomical structure identification, image orientation and field-of-view (see Fig. 1c–e). Only few articles discussing MRI and CT (CBCT) image registration for TMJ visualization or assessment were published within the last 7 years [4]. Lin was the first to explore the 3D rendering of mandible from MRI and CT registered images with 12 fiducial markers attached to the facial skin-surface [5]. In a brief clinical report, Dai chose one sagittal slice of TMJ MRI and CT images from a previous study, as an example, to illustrate a hybrid image of TMJ via Photoshop® software [6]. Al-Saleh published the first study that employed MRI and CBCT registered images to assess diagnostic reliability of TMJ pathology [3]. They evaluated the quality of two techniques of image registration, extrinsic (fiducial marker-based) versus intrinsic (voxel value mutual information based) in 20 TMJ images. In a latest report, Ma, one author of this article, imported the DICOM format data of CT/CBCT and MRI into Amira® to realize automatic/semi-automatic registration of multi-modality images by adjusting the registration parameters [7].

Fig. 1.
figure 1

Anatomical structure and images registration of TMJ. (a) Anatomical structure, (b) fused MRI-CBCT image, (c) MRI image, (d) CBCT image, and (e) registered image. From these images, we can see huge field-of-view difference between different modality images.

Related Works:

Deep learning methods have been shown strong advantages in medical image registration [8]. In order to evaluate the posture and position of the implant during surgery, Miao proposed a hierarchical regression model to achieve six transformation parameters for real-time 2D/3D registration, in which ground truth data is synthesized by transforming aligned data [9]. Chee proposed a self-supervised affine image registration network (AIRNet) for 3D medical images that is designed to directly estimate the transformation parameters between two input images, in which the synthetic dataset was used for the training of the model [10]. The difficult nature of the acquisition of reliable ground truth has motivated research groups to explore unsupervised approaches for image registration transformation estimation [8]. Kori proposed an unsupervised image registration framework for multi-modality MRI image affine registration. Pre-trained VGG-19 was used for feature extraction followed by a key point detector. These key points were fed to the Multi-Layered Perceptron (MLP) based regression module so as to estimate the affine transformation parameters trained by generated set of random data points [11]. In order to register arbitrarily oriented reconstructed images of fetuses scanned in-utero at a wide gestational age range to a standard atlas space, Salehi proposed regression CNNs that learn to predict the angle-axis representation of 3D rotations and translations using image features. They compared mean square error and geodesic loss to train regression CNNs for 3D pose estimation in slice-to-volume registration and volume-to-volume registration [12]. Combined with unsupervised network, coarse-to-fine multi-scale iterative framework and image deformation, Shu proposed an unsupervised network for microscopic image rigid registration. The network optimizes its parameters directly by minimizing the mean square error loss between registered image and reference image without ground truth [13]. As far as we know, no research work involves the problem of different field-of-views in multi-modality medical image registration, which bring difficulties to the current learning-based registration methods.

Contribution:

The main contributions of this work are summarized as follows: (1) Landmark-guided mechanism was introduced to effectively register MRI-CBCT images of TMJ with large different field-of-views, without any prior assumption on the image pairs. (2) Compared with affine matrix learning methods for rigid images registration, our image spatial transform regression network predicts real rigid transformation for multi-modality images.

2 Method

2.1 Overall Framework

Fig. 2.
figure 2

The workflow of our landmark-guided rigid registration framework applied for TMJ MRI-CBCT images with different field-of-views. We highlight that all registrations are done in 3D throughout this paper. For clarity and simplicity, we depict the 2D formulation of our method in this paper.

As shown in Fig. 2, our overall framework for MRI-CBCT images registration includes three stages. Firstly, landmark numerical coordinate regression network takes IMRI and ICBCT as input and estimates landmarks’ coordinate LMRI(x, y, z) and LCBCT(x, y, z) respectively. Then the spatial transform network regress the rigid transformation matrix Tθ between two image patches PMRI and PCBCT centered landmarks. Finally, combined the rigid transformation matrix \(T_{\theta }^{*}\) with the landmark-guided information, the rigid registration of MRI-CBCT images is achieved.

2.2 Landmark Localization Network

Inspired by landmark localization in human pose estimation [14], we proposed an end-to-end landmark localization network for 3D medical images (L2Net) by converting heat-map regression into coordinate regression task. L2Net consists of feature extraction network (U-Net) and coordinate regression layer (Fig. 3). For more technical details you can read our previous work in reference [15].

Fig. 3.
figure 3

Architecture of L2Net for 3D medical images: a feature extraction network that extracts modality independent feature as implicit heat-map H(I; ω); and a coordinate regression layer that map the feature H(I; ω) to landmark coordinate L(x, y, z).

Given a MRI/CBCT image I with size of v = m × n × k, U-Net learning the image feature to output the same size implicit normalized heat-map H(I; ω). By taking the probabilistic interpretation of H(I; ω), we can represent the landmark coordinates, L(x, y, z) as center of mass (centroid) function defined as

$$ L = (x,y,z){{\sum\limits_{(x,y,z) \in v} {(x,y,z)*H(I;\omega )} } \mathord{\left/ {\vphantom {{\sum\limits_{(x,y,z) \in v} {(x,y,z)*H(I;\omega )} } {\sum\limits_{v} {H(I;\omega )} }}} \right. \kern-\nulldelimiterspace} {\sum\limits_{v} {H(I;\omega )} }} $$
(1)

Loss Function:

Since the coordinate regression layer outputs numerical coordinates, it is possible to directly calculate Euclidean distance between the predicted coordinate Linf(x, y, z) and ground truth Lgt(x, y, z). We take advantage of this fact to formulate the main term of landmark localization loss function (Eq. 2).

$$ \mathcal{L}_{{{\text{euc}}}} (L_{{{\text{gt}}}} ,L_{{{\text{inf}}}} ) = \left\| {L_{{{\text{inf}}}} - L_{{{\text{gt}}}} } \right\|_{2} $$
(2)

The shape of the implicit heat-map also affects the regression accuracy of landmark coordinate [16]. More specifically, to force the implicit heat-map to resemble a spherical Gaussian distribution, we can minimize the divergence between the heat-map H(I; ω) and an appropriate target normal distribution \(N(L_{\inf } ,\sigma_{t}^{2} )\). Equation 3 defines distribution regularization, where D(·||·) is the Jensen-Shannon divergence.

$$ \mathcal{L}_{{{\text{reg}}}} (H(I;\omega ),L_{{{\text{inf}}}} ,\sigma_{t} ) = D(H(I;\omega )||N(L_{\inf } ,\sigma_{t}^{2} )) $$
(3)

Equation 4 shows how regularization is incorporated into the Euclidean distance function. A regularization coefficient, λ, is used to set strength of the regularizer, \(\mathcal{L}_{{{\text{reg}}}}\).

$$ \mathcal{L}_{{{\text{lmk}}}} = \mathcal{L}_{{{\text{euc}}}} (L_{{{\text{gt}}}} ,L_{{{\text{inf}}}} ) + \lambda \cdot \mathcal{L}_{{{\text{reg}}}} (H(I;\omega ),L_{{{\text{inf}}}} ,\sigma_{t} ) $$
(4)

2.3 Rigid Registration Network

The architecture of spatial transform regression network used for rigid registration is shown in Fig. 4. The input is the concatenated patch pairs (PCBCT || PMRI) that centered landmarks and the output is the transform matrix Tθ which indicates the spatial relationship between two image patches. Each convolution layer is zero-padded and is followed by ReLU activations. After max pooling for two times, two fully connected (FC) layers with ReLU activation function are used to gather information from the entire images to give the rigid transform matrix: M = [θx, θy, θz, Δx, Δy, Δz,].

Fig. 4.
figure 4

Architecture of multi-modality image rigid transformation matrix regression network.

Transformation Matrix Mapping:

This layer converts the transform M into exact rigid transformation matrix, instead of affine matrix used in reference [10, 11, 13]. Therefore, the shearing transformation caused by affine transformation can be eliminated, so as to improve the accuracy of rigid registration. To make the entire registration network training process back-propagating, this parameters mapping process must meet the derivability requirements. For rotation we can get rotation matrix Rx as following,

$$ \begin{array}{*{20}l} {R_{x} = \cos (\theta_{x} ) * \left[ {\begin{array}{*{20}c} 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 \\ \end{array} } \right] + \sin (\theta_{x} )*\left[ {\begin{array}{*{20}c} 0 & 0 & 0 & 0 \\ 0 & 0 & { - 1} & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ \end{array} } \right] + \left[ {\begin{array}{*{20}c} 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ \end{array} } \right]} \hfill \\ {\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \qquad \qquad \quad \,\quad \quad \quad \qquad = \left[ {\begin{array}{*{20}c} 1 & 0 & 0 & 0 \\ 0 & {\cos (\theta_{x} )} & { - \sin (\theta_{x} )} & 0 \\ 0 & {\sin (\theta_{x} )} & {\cos (\theta_{x} )} & 0 \\ 0 & 0 & 0 & 1 \\ \end{array} } \right]} \hfill \\ \end{array} $$

The rotations matrices Ry and Rz (similar definitions with Rx), specified in this way, determine an amount of rotation about each of the individual axes of the coordinate system. And for translation matrix,

$$ \begin{array}{*{20}l} {D = \Delta x * \left[ {\begin{array}{*{20}c} 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ \end{array} } \right] + \Delta y * \left[ {\begin{array}{*{20}c} 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ \end{array} } \right] + \Delta z * \left[ {\begin{array}{*{20}c} 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 \\ \end{array} } \right]} \hfill \\ {\quad \quad \quad \quad \quad \quad \quad \quad \quad \qquad \qquad \qquad \quad \,\,\, + \left[ {\begin{array}{*{20}c} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} 1 & 0 & 0 & {\Delta x} \\ 0 & 1 & 0 & {\Delta y} \\ 0 & 0 & 1 & {\Delta z} \\ 0 & 0 & 0 & 1 \\ \end{array} } \right]} \hfill \\ \end{array} $$

Combined these rotation and translation matrices together in certain order, we can get the rigid transformation matrix Tθ,

$$ T_{\theta } = R_{x} *R_{y} *R_{z} + D $$
(5)

Once the transformation matrix Tθ is obtained, a spatial transformation layer [14] is used to warp the moving image using the deformation field Tθ. Each voxel in the warped image CBCT is calculated by bi-linear interpolation in the corresponding location, as given by the displacement vector, in the subject image PCBCT:

$$ \tilde{P}_{{{\text{CBCT}}}} (p) = \sum\limits_{q \in N(p + \omega )} {P_{{{\text{CBCT}}}} (q)(1 - \left| {p + \omega - q} \right|_{2}^{2} )} $$
(6)

Where p and q are coordinates on the image, and ω is the displacement of p, N(p + ω) is the set of 8-pixel cubic neighbors of p + ω.

Loss Function:

Mutual information (MI) has become a common loss function for (especially multi-modality) image registration [17]. Formally, the mutual information between our image patches PMRI and \(\tilde{P}{\text{CBCT}}\) is defined as the following,

$$ {\text{MI}}(P_{{{\text{MRI}}}} ,\tilde{P}_{{{\text{CBCT}}}} ) = \sum\limits_{x} {\sum\limits_{y} {p_{{\text{MRI,CBCT}}} (x,y)\log \frac{{p_{{\text{MRI,CBCT}}} (x,y)}}{{p_{{{\text{MRI}}}} (x)p_{{{\text{CBCT}}}} (y)}}} } $$
(7)

In order to realize the training of end-to-end image registration network, here we use Parzen windowing [18] to calculate differentiable MI for loss function,

$$ \mathcal{L}_{{{\text{sim}}}} (P_{{{\text{MRI}}}} ,\tilde{P}_{{{\text{CBCT}}}} ) = - {\text{MI}}(P_{{{\text{MRI}}}} ,\tilde{P}_{{{\text{CBCT}}}} ) $$
(8)

The optimal rigid transformation matrix \(T_{\theta }^{*}\) is finally obtained through the network training. Finally, the coordinate offset between landmarks LMRI(x, y, z) and LCBCT(x, y, z) is mapped to the transformation matrix \(T_{\theta }^{*}\), accordingly the spatial transformation matrix between the IMRI and ICBCT is obtained.

3 Experiments

3.1 Data

The TMJ dataset consists of 204 CBCT and paired MRI images from 102 patients in Peking University School and Hospital of Stomatology. CBCT images are of size 481 × 481 × 481, where each voxel is of size 0.125 × 0.125 × 0.125 mm3. MRI images are 256 × 256 × (7–11) with voxel size of 0.546875 × 0.546875 × 3.3 mm3. We group the intensity of each image according to the modality and perform histogram matching.

3.2 Training Details

Network Training.

Localization networks were trained using mini-batches of 1-sample each. The implicit heat-map is the same size as the resampled input image with σ = 5. In our experiments, we picked λ = 1 using cross validation. The models were optimized with RMSProp using an initial learning rate of 2.5 × 10–4. Each model was trained for 120 epochs, with the learning rate reduced by a factor of 10 at epochs 60 and 90 (an epoch is one complete pass over the training set). In order to increase the ability of registration network to capture displacement of the input image patches, along the x, y, and z directions the landmark coordinates are randomly added by an offset [−60 ~  + 60] as the center position of image patches. We set the learning rate to 1.0e-2, and use exponential decay to adjust the learning rate parameters (ExponentialLR) method, where the basis coefficient gamma is set to 0.95. We implemented our method using Pytorch, and used a workstation equipped with single Maxwell-architecture NVIDIA Titan X GPU.

Landmark Localization Results.

The mean radial error (MRE, in mm) is the commonly used evaluation index for medical image landmark detection task [15]. Compared with heat-map based method, the MRE result of our networks is lower obviously and modality independent (Fig. 5).

Fig. 5.
figure 5

Distance error of the landmark localization. MRI (left): 6.6803 ± 7.0876 mm (heat-map based method), 2.0244 ± 1.0635 mm (our method); CBCT (right): 7.6375 ± 10.0229 mm (heat-map based method), 2.6371 ± 1.2982 mm (our method).

Rigid Registration Results.

In order to evaluate the performance of our proposed unsupervised learning rigid registration network, here we choose to compare the methods including SimpleElastixFootnote 1, ANTsFootnote 2, two software packages based on traditional iterative optimization, and an affine transformation matrix regression based on convolutional neural networks such as reference [10].

Fig. 6.
figure 6

Comparison of image patch registration results of various methods. (a) Inital image, (b) ANTs-Affine, (c) Simple Elastix, (d) Learning Affine, (e) Our method.

Figure 6 shows the registration results of multiple methods on the same image patches. In these figures, there are two layers: the bottom layer is an MRI image and the top layer is CBCT image with a color overlay. The first column shows the whole TMJs at the center of the image patch. The two initial images have obvious position and angle misalignment in Fig. 6a. ANTs software package cannot effectively complete the registration task (Fig. 6b). The SimpleElastix obtained good alignment (Fig. 6c) after iterative optimization. The shearing transformation in the affine registration makes the spatial position relationship tilted (Fig. 6d). The mutual information and structural similarity (SSIM) [18] between the registered patches is the most commonly used index to measure the alignment. Table 1 gives quantitative evaluation results of the registration methods.

Table 1. Comparison MI and SSIM of the registration results of different methods.

Combined landmark localization with unsupervised image registration stages together, the registration result for the whole MRI-CBCT images is shown in Fig. 7. Conversely, SimpleElastix and Ants software packages, neither of them can achieve registration processing result successfully.

Fig. 7.
figure 7

Rigid registration result for the whole TMJ MRI-CBCT images. The center square area is the result of MRI and CBCT superposition.

4 Conclusion

We proposed an rigid registration network guided by landmarks for the common clinical application of multi-modality medical image registration problems. End-to-end landmark localization network effectively solves the influence of field-of-view difference between different modality images, and rigid transformation regression improves the registration accuracy and speed. We conclude that our method can effectively solve similar image registration applications.