Keywords

1 Introduction

Thousands of people around the world suffer from cochlear damage which affects their hearing ability. Cochlear implant (CI) surgery is currently an effective method for the treatment for deafness and severe losses in hearing [1]. Doctors use a manual procedure, which costs time and effort, to get relevant information from medical images of the cochlea before CI surgery, and then work at the limits of their visual-tactile feedback and dexterity. Robotic cochlear implantation can reduce the instability of man-made operations in traditional surgery [2]. While automating this manual procedure is a challenging problem due to the small size and complicated structure of the cochlea. Combining cochlea images from different modalities using image registration and fusion techniques may help in the automation of cochlea structure identification, more accurate measurements of the cochlea, drilling path planning and implementation relevant to CI surgery.

Preoperative imaging is performed before CI surgery. A CT scan of the patient is done to identify ear anatomy and segment facial nerve [3]. An optimally safe drilling trajectory is computed based on the identified structures. During surgery, a Cone-beam computed tomography (CBCT) scan of the patient is obtained to help ensure the patient positioning and present the anatomy during treatment, which is a common on-treatment imaging method owning to its fast acquisition, cost-effectiveness, and low dose to patients. Registration between the planning CT and the intra-operative CBCT is crucial to match the drilling path and cochlear structures between two time points.

In general, the existing medical image registration methods are mainly classified into feature-based, intensity-based, segmentation-based and fluoroscopy-based [4]. Segmentation-based registration need to define a region of interest, and fluoroscopy-based registration is used for 2D-3D registration. So we mainly talk about the first two kinds. Feature-based registration has been used for CT-CBCT or other modalities image registrations in [5,6,7], it is computationally efficient, but the quality of registration largely depends on the accuracy of extracting and matching features. Usually, manual participation is required, even not, how to precisely match corresponding points is a problem to be solved. Intensity-based registration directly operate on voxel values, thus without manual intervention. This method achieves the purpose of aligning the two images by maximizing the similarity measures of the two images. These similarity measures include the sum of squared differences (SSD) for monomodal registration, mutual information (MI) or correlation coefficient (CC) for multimodal registration. CBCT intensity is inconsistent with CT due to artifacts from various sources such scatter, truncation. So even though CT and CBCT use the same imaging modality, X-ray, the CT-CBCT registration can be regarded as a special case of multimodal registration [8], where MI or CC is widely used. Even though intensity-based registration is widely used for its simplicity and easy-operability, it is limited by time and computing resources. Fortunately, with a hierarchical system (adoption of pyramid structure on images, eg. the method in [9]), computing time and memory can be greatly reduced. Furthermore, a large number of software have been presented for medical image registration based on intensity, such as ITK, 3D Slicer, Elastix or other commercial software. Most of them are based on C++ or shell language to improve coding performance and speed. In our study, we used Elastix for intensity-based CT-CBCT registration.

How to measure the registration effect is a more critical issue, because if there is no accurate value, no matter how advanced the algorithm is, it has no practical value. The most common method of performance evaluation is similarity measures between the whole images or outlined structures, it is the criteria to evaluate how much two or more images are similar. Among various similarity measures, root mean square error (RMSE) is the simplest, others include structural similarity index (SSIM), dice similarity coefficient (DSC), etc. While study shows that these measures, even when used in combination, cannot distinguish accurate from inaccurate registrations [10]. In addition, these measures often have no geometric significance. More reliable measure is target registration error (TRE). It evaluates the registration accuracy based on points correspondence, by computing Euclidean distance between corresponding points. It has more physical meaning, but how to choose and correspond to these points is a problem. In our study, we will comprehensively use the similarity measures and TRE to evaluate the registration error. Among them, the similarity measure is used to evaluate the overall registration effect, and the TRE is used to make up for the limitations of the similarity measures and provide a more intuitive judgment. We will discuss the methods in detail in the following sections.

2 Methods

In this section, we present a detailed description of the process we use to perform automatic registration of pre-operative CT images and intra-operative CBCT images. Most importantly, we propose specific evaluation metrics for the registration accuracy requirements, which needs to pay attention to in the actual surgery.

2.1 CT-CBCT Registration

Image Preprocessing.

The raw data has different voxel size and dimensions, and contains patient bed which may be a disruptive factor in subsequent registration processes. So firstly, we need to resample the data to the same resolution. And then, remove patient bed from CT and CBCT images, consist of intensity normalization, binarization, and morphological processing operations, as shown in Fig. 1.

Fig. 1.
figure 1

Remove patient bed from the image (take one of the slices as an example) (a) the original image; (b) binary segmentation; (c) morphological processing: opening operation, filling holes; (d) extract the largest connected component and generate mask; (e) using the mask on (a)

Intensity-Based Image Registration.

The algorithm and the components of intensity-based image registration used in our study are described in the flowchart in Fig. 2.

The preprocessed images act as input images. When starting the iteration process, it is necessary to sample, that is to say, adopt a hierarchical strategy. If not, it is time-consuming for large images. Then, in each level of the pyramid, the images go into the registration process, computing the cost function, e.g. the advanced mattes’s mutual information (AMMI). The regular step gradient descent (RSGD) optimizer modifies parameters of the affine transform to minimize the cost function. When AMMI is maximum or it has reached the maximum iteration, optimization process ends, and outputs the transformation matrix. The transformation matrix is applied to the moving image to obtain the registered image.

We do this process using Elastix [11], which is an open-source intensity-based medical image registration software, based on the well-known Insight Segmentation and Registration Toolkit (ITK). The software allows the user to set various parameters to quickly configure, test, and compare different registration methods for a specific application. In previous studies, Elastix has been widely used for mono-modal or multi-modal, rigid or non-rigid registration [12,13,14,15], but rarely used for CT and CBCT registration.

Fig. 2.
figure 2

Intensity-based registration flowchart

2.2 Evaluation

For robotic cochlear implant surgery, the registration speed and accuracy are the issues we focus on, otherwise, a little carelessness can cause damage to the nerve around the cochlea, since for a facial nerve a margin of up to 1.0 mm is available and an accuracy of at least 0.3 mm is required, depending on the navigation system [16]. In terms of time, from the start of timing after reading the dicom image to the stop of generating the registration image or registration matrix, this time should be less than 2 min as a project metric. There is currently no convincing gold standard for measuring registration accuracy. As is mentioned above, similarity measures cannot always distinguish accurate from inaccurate registrations, only used as a reference indicator. At present, the more reliable, intuitive, and widely used evaluation metric is target registration error (TRE). While in our study, in order to comprehensively evaluate the registration results, we use both similarity measures and TRE to evaluate registration result.

For TRE, we evaluate registration accuracy from the following two aspects: on the one hand, segment a specific structure in the two images and calculate the distance of centroid distance [17]. To ensure the accuracy of the segmentation, we used implanted titanium screws as the target structure, because in the image, the brightness and contrast of the screws is much higher than the surrounding tissue. On the other hand, determine the corresponding points in the two images and calculate the average Euclidean distance. Points can usually be manually selected by experienced doctors, however, this is more affected by human factors and cannot be accurate to a single voxel. So we adopted the method of automatically selecting points, based on the SIFT feature operator. Then we filter out some of these landmarks based on manual experience. We will discuss these two aspects in detail in the sections below.

Similarity Metrics.

Similarity metrics include root mean square error (RMSE), correlation coefficient (CC), normalized mutual information (NMI), structural similarity index (SSIM). The calculation formula of each metric is as follows. X, Y represent the two images, \({x}_{i}\) and \({y}_{i}\) are the gray value of the \(\mathrm{i}\)th voxel, \({\mu }_{x}\) and \({\mu }_{y}\) are the mean gray value of the two images, \(\mathrm{p}\) stands for gray value distribution probability, \({\sigma }_{x}^{2}\) and \({\sigma }_{y}^{2}\) are the variances, and \({\sigma }_{xy}\) is the cross-covariance. \({C}_{1}\) and \({C}_{2}\) are regularization constants for the luminance and contrast respectively.

$$\mathrm{RMSE}=\sqrt{\frac{1}{n}\sum {({x}_{i}-{y}_{i})}^{2}}$$
(1)
$$\mathrm{CC }= \frac{\sum ({x}_{i}-{\mu }_{x})({y}_{i}-{\mu }_{y})}{\sqrt{\sum {({x}_{i}-{\mu }_{x})}^{2}}\sqrt{\sum {({y}_{i}-{\mu }_{y})}^{2}}}$$
(2)
$$\mathrm{NMI}\left(\mathrm{X},\mathrm{ Y}\right)=2\frac{\sum p\left({x}_{i},{y}_{i} \right)\mathrm{log}\frac{p\left({x}_{i},{y}_{i} \right)}{p\left({x}_{i}\right)p\left({y}_{i}\right)}}{\sum p\left({x}_{i}\right)logp\left({x}_{i}\right)+p\left({y}_{i}\right)logp\left({y}_{i}\right)}$$
(3)
$$\mathrm{SSIM}\left(\mathrm{X},\mathrm{Y}\right)=\frac{(2{\mu }_{x}{\mu }_{y}+{C}_{1})({2\sigma }_{xy}+{C}_{1})}{({\mu }_{x}^{2}+{\mu }_{y}^{2}+{C}_{1})({\sigma }_{x}^{2}+{\sigma }_{y}^{2}+{C}_{1})}$$
(4)

Screws Centroid Position.

The TRE is defined as the mean Euclidean distance between the centroid of the eight corresponding screws implanted in the specimens. The local positions of eight screws are shown in Fig. 3. They can be easily identified using threshold segmentation [18]. Then centroid position in a volume image is calculated as the equation below, where g(i, j, k) is the gray value at the voxel (i, j, k).

$$\mathrm{x}=\frac{\sum g\left(i,j,k\right)*i}{\sum g\left(i,j,k\right)}$$
(5)
$$\mathrm{y}=\frac{\sum g\left(i,j,k\right)*j}{\sum g\left(i,j,k\right)}$$
(6)
$$\mathrm{z}=\frac{\sum g\left(i,j,k\right)*k}{\sum g\left(i,j,k\right)}$$
(7)
Fig. 3.
figure 3

Local positions of eight screws

Feature Points Extraction.

To compute the average distance between the points of the reference and the registered images, we need to extract feature points and match corresponding points. The scale invariant feature transform (SIFT) is invariant to rotation, scaling, and brightness changes [19]. So it is capable of extracting and matching stable and characteristic points between two images. The SIFT-feature-based registration has been used in [7, 20, 21]. In our study, we use SIFT feature not for registration, but for evaluation.

SIFT feature extraction includes extreme detection in scale space, keypoints localization, orientation and generating a features vector called “descriptors”. The whole process is shown in Fig. 4 and Fig. 5. Scale-space refers to the space formed by the convolution of a Gaussian function with the original image at different resolutions called ‘octave’. The general principle of extreme value detection is to find local extremes based on the difference of Gaussians (DoG) in each octave, as shown in Fig. 4(b). The points corresponding to these found extremes, comparing each voxel to its neighbors, are called keypoints, shown in Fig. 4(c). In order to match feature points in two images, the directions of keypoints should firstly be determined, that is, the direction in which the gray value decreases the fastest, and the gradient direction and amplitude of all voxels within a certain range with the feature point as the center of the circle are counted. The angle with the highest amplitude is the main direction (Fig. 4(d)), (in order to increase robustness, an auxiliary direction is usually determined). Rotate the image to the main orientation, calculate the gradient direction histogram of eight directions in sub-region, and draw the accumulated value of each gradient direction to for a seed point (Fig. 4(e)).

For registration accuracy, the most similar SIFT descriptors in reference images and registered images need to be identified (Fig. 5). We computed the nearest and the second nearest distance neighbor in two feature descriptors. If the ratio is below a threshold [7], the feature having the lowest distance value is chosen to corresponding points, and the value is TRE. Otherwise, no association is identified.

Fig. 4.
figure 4

(a) Original image; (b) DoG in each octave: each column stands for a resolution (from left to right, the resolution decreases) and each row stands for a Gaussian coefficient (from top to bottom, the coefficients increase, and the image becomes blurry); (c) Keypoints detection. (d) Orientation; (e) Generate descriptors (one of descriptors is shown)

Fig. 5.
figure 5

The process of matching corresponding points

3 Experiments

3.1 Datasets

In this study, we conducted experiments on 16 pairs of pre-operative CT scans and intra-operative CBCT scans, of which 14 pairs were human data and 2 pairs were cadaveric data. Detailed data information is shown in Table 1. The institutional review board has approved this study.

Table 1. (a) Image-acquisition settings (CBCT). (b) Image-acquisition settings (CT)

3.2 Experimental Setup

Registration was done by using Melastix Toolbox, which is a collection of MATLAB wrappers for Elastix. Program runs on MATLAB version 2021a based on Intel(R) Core(TM) i7-9750H CPU (2.60 GHz, 2592 MHz, 6 cores, 12 logical processors).

3.3 Registration Parameters

The main registration parameters we use based on Elastix are shown in Table 2.

Table 2. CT-CBCT registration parameters

4 Results

Some visual results of the registration are shown in Fig. 6, showing three slices of corpse_head1 and corpse_head2, human4 and human10. From left to right, they are the reference images, registered images and fusion display of them.

In the fusion image map, gray regions have the same intensities, while magenta and green regions show where the intensities are different. For the reason that CT and CBCT have different gray value ranges, we can see most of the area is colorful. The images are aligned, which is reflected in the overlapping of magenta and green. Among them, the corpse_head2 was deformed due to the long soaking time and incorrect placement. We used affine transformation because in the actual operation, the head will not deform greatly.

Fig. 6.
figure 6

Three of the slices are shown, the left column is reference image, the middle column is registered image, and the right column is fusion display, where magenta and green regions show different intensities. (a) corpse_head1 (b) corpse_head2. (c) human4. (d) human10. (Color figure online)

The centroid distances of the eight screws implanted in the two cadaver heads are shown in Table 3, and Table 4 shows registration time, average distance of the corresponding feature points (AveDis) and similarity of the two cadaver heads. Human’s results are shown In Table 5.

Table 3. Centroid distance of the eight screws (mm)
Table 4. Comparison of registration time, average distance and similarity metrics (corpse_head1–2)
Table 5. Comparison of registration time, average distance and similarity metrics (human1–14)

To sum up, first of all, the registration speed depends largely on the hardware equipment. In our experiments, the entire registration process can be completed within one minute for all 16 sets of data. Secondly, in terms of accuracy, we calculated the similarity metrics, the centroid distance of the implanted screws and average distance of feature points. About the similarity metrics, the larger the value, the higher the grayscale similarity of the two. We found that the 14 human datasets are significantly higher than that of 2 corpse head datasets. One of the corpse heads was deformed due to the long soaking time, and the other had limited field of view during scanning, and part of the voxel information was missing. These may be the reasons for the low similarity metrics.

For corpse data, from Table 3 and Table 4, a good alignment is achieved during the registration process. The average centroid distance of implanted screws is 0.19 mm and 0.12 mm respectively, which can basically meet the requirements of surgical precision. In the extraction of corresponding feature points, key points are automatically extracted and matched based on SIFT. After obtaining the corresponding points, combined with manual experience, another selection is carried out to remove the obviously non-corresponding points on the image to ensure that the final results are not affected by individual abnormal points. Results show that TRE obtained by implanted titanium screws is very close to that obtained by SIFT feature points. So SIFT feature extraction can be used to replace titanium screws implanting during the pre-operation period for registration results evaluation, which will greatly simplify surgical procedure and avoid unnecessary injury.

For human data, the average feature points distances of 14 datasets are less than one voxel. Furthermore, as can be seen from Table 5, although the similarity metrics seem great, the TRE results do not coincide with them. Therefore, we cannot estimate registration quality only from the similarity, for it reflects the difference of the overall gray value, but the difference in image structure cannot be seen.

5 Conclusion

Cochlear implant surgery requires the registration of the pre-operative CT and intra-operative CBCT images to map the preoperatively computed drilling trajectory into the intra-operative space. For robotic surgery, registration speed and precision are especially important. In this paper, we use Elastix to perform intensity-based image registration, which can complete the entire process in one minute. In terms of accuracy, the similarity metrics cannot reflect the geometric difference characteristics, and is easily affected by the gray value. The target registration error of the two corpse head datasets are both below 0.3 mm, whether it is the distance of the screw centroid or feature points. We also find that results of implanted titanium screws and SIFT feature points are very close. So SIFT feature extraction can be used to replace titanium screws implanting during the pre-operation period for registration results evaluation, which will greatly simplify surgical procedure and avoid unnecessary injury. For the 14 human datasets, the similarity metrics are relatively high, and the average point distance is less than one voxel size, which is a reasonable result of image registration. In clinical surgery, when high registration distance accuracy is required, high resolution image should be obtained correspondently.