Abstract
High-definition video is becoming a standard in clinical endoscopy. State-of-the-art systems for medical endoscopy provide 1080p video streams at 60 Hz. For such high resolutions and frame rates, the real-time execution of image-processing tasks is far from trivial, requiring careful algorithm design and development. In this article, we propose a fully functional software-based solution for correcting the radial distortion (RD) of HD video that runs in real time in a personal computer (PC) equipped with a conventional graphics processing unit (GPU) and a video acquisition card. Our system acquires the video feed directly from the digital output of the endoscopic camera control unit, warps each frame using a heterogeneous parallel computing architecture, and outputs the result back to the display. Although we target the particular problem of correcting geometric distortion in medical endoscopy, the concepts and framework herein described can be extended to other image-processing tasks with hard real-time requirements. We show that a heterogeneous approach, as well as efficient memory access patterns in the GPU, improve the performance of this highly memory-bound algorithm, leading to frame rates above 250 fps.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Endoscopic images are usually acquired using a flexible or rigid lens coupled to a CCD sensor. The lens is introduced into the patient’s body through a small port for visualizing the anatomical cavities during surgery and diagnosis. In this type of procedure, the endoscopic video is the only guidance for the medical practitioner and therefore the system has to provide the best imaging quality possible [1]. The latest endoscopic systems use 1080p video (1,920 × 1,080 pixels/frame) at 60 Hz as the standard for visualization. The HD resolution and the high frame rate allow an enhanced visualization of the structures and therefore are likely to significantly improve the surgeon’s perception [2].
Due to the small size of the lens, endoscopic images present strong RD (also known as barrel distortion). RD is a nonlinear geometric deformation of the image that moves the points radially toward the center and can severely affect the notion of depth [3]. Several authors have addressed the problem of RD correction in wide angle lenses [4–7], but these either are not suited for real-time HD processing [4, 5] or do not fully model the endoscopic camera [6, 7].
Advances in very large-scale integration (VLSI) systems using the distortion model estimation proposed in [4] showed promising results in the correction of RD using dedicated hardware. Asari presented in [8] an efficient VLSI architecture to correct the RD in wide-angle camera images by mapping the algorithmic steps onto a linear array. Later, in [9], a pipelined architecture was presented that was able to process images at a rate of 30 Mpixels/s. In [10], the authors proposed a VLSI implementation for RD correction that reduced in 61 % the number of cells compared to [9] and achieved a throughput of 40 Mpixels/s. The recent work presented in [11] reduced at least 69 % hardware cost and 75 % memory requirement compared to previous works. In [12], the authors presented a comparison of RD correction implementations on a homogeneous multi-core processor, a heterogeneous cell broadband engine, and an FPGA. They concluded that only an FPGA and a fully optimized version of the code running on the cell processor could provide real-time processing speed (30 fps for input images of 2,592 × 1,944, which translates into a throughput of 150 Mpixels/s).
While previous software-based implementations fail to process large amounts of data in real time or do not fully model the endoscopic camera, hardware-based solutions lack the versatility to adapt to different devices or lenses (and therefore changes in the projection model in real time) and involve additional costs and effort to implement.
In this work, we propose a system for acquiring and processing the HD video feed from an endoscope in real time using a conventional PC equipped with an acquisition board and a GPU. Our solution is based on the work of [13] that updates the endoscopic camera projection model [6], according to the possible lens rotation at each frame time instant. Our system acts like a plug-and-play module that captures the video feed, processes each frame on a regular PC, and then outputs the result back into the existing visualization system (see Fig. 1). We verify that a homogeneous multi-core CPU is not capable of supporting HD real-time video distortion correction, as observed in [12], and the GPU-based implementation of [13] also fails to deliver the necessary frame rates for the latest endoscopic devices. Our framework for correcting the radial distortion of a HD video stream is based on a heterogeneous implementation that uses both CPU and GPU concurrently. We demonstrate that a hybrid solution, where the computational workload is distributed across the CPU and the GPU in parallel, enables the processing of the video feed (1,920 × 1,080 pixels/frame) at frame rates up to 250 fps (500 Mpixels/s throughput) when implementing efficient memory access patterns on the GPU side of the heterogeneous parallel system.
2 Radial distortion correction in clinical endoscopy
This article is closely related to the work presented in [13]. While [13] describes the camera projection model, the calibration of the endoscope in the OR, and the estimation of the relative rotation of the lens scope, the current article addresses in detail the problem of efficient implementation for real-time execution in HD endoscopic devices.
In [13], the endoscopic camera is calibrated from a single image of a chessboard pattern [14] with the radial distortion being described by the so-called division model [15] that uses a single parameter ξ to quantify the amount of image deformation. Consider \(\mathbf{X}\) the 3D coordinate of a point in the world reference frame. The corresponding point \(\mathbf{x}^{\prime}\) in the image plane is determined by the projection Eq. 1:
where ∼ denotes an equality up to a scale factor, \(\mathsf{K}\) is the well-known intrinsics matrix obtained by the camera calibration and \(\mathsf{P}\) denotes the standard 3 × 4 projection matrix [3]. \({\varvec{{\Upgamma}}}_{\xi}\) is the radial distortion non-linear function that maps world undistorted points \(\mathbf{x}_{\rm u} \sim (x_{\rm u}\ y_{\rm u}\ z_{\rm u})^{\rm T}\) into the corresponding world distorted point:
Assuming that the 3D point \(\mathbf{X}\) is represented in the camera reference frame, \(\mathsf{P} \sim (\mathsf{I}_{3 \times 3} \ 0_{3 \times 1})\) and therefore we can compute the distorted image coordinates of an undistorted image point \(\mathbf{x}_u^{\prime}\) (in pixels) by:
where \({{\mathsf{K}}}^\mathsf{-1}_y\) maps the undistorted image point \(\mathbf{x}_{\rm u}^{\prime}\) into a canonical plane, specifying certain desired characteristics of the undistorted image (e.g., center, resolution) [13].
The camera calibration changes during operation because the doctor rotates the lens scope with respect to the CCD head. As discussed in [13], the problem can be solved by considering an adaptive projection model that takes into account this relative rotation. The authors devised an efficient algorithm for extracting the image boundary and detecting the lens mark (Fig. 2) that relies on the extraction of boundary contour points on the GPU, as well as standard methods implemented on the CPU (such as RANSAC [16] and Kalman filtering [17]) to deliver a robust estimation of the rotation parameters. With this new adaptive projection model, the intrinsics matrix is updated by a rotation around the lens rotation center \(\mathbf{q}\) and the distortion mapping of Eq. 3 becomes:
where \(\mathsf{K}_i \sim \mathsf{R}_{\alpha_i,\mathbf{q}_i} \mathsf{K}\) is the intrinsics matrix updated according to the lens rotation α at time i and \(\mathsf{R}_{-\alpha_i,\mathbf{q}_i^{\prime\prime}}\) is a rotation matrix that rotates the warping result back to the original orientation.
Figures 2 and 3 show the results of the RD correction in different environments using the mapping function in Eq. 3, where the effects of the distortion are easily noticed.
3 Proposed system
GPUs have emerged as powerful processors in the last few years. The recent introduction of the CUDA interface [18] has enabled the scientific community to parallelize computationally intensive algorithms and to achieve faster execution times [19]. The Nvidia GF100 and subsequent architectures (also known as Fermi) introduced significant improvements in memory accesses and a drastic increase in compute capability when compared with the previous G80 and GT200 families. In Sect. 4, we perform experiments using both older architectures (G80 and GT200) and the more recent Fermi family.
The proposed system consists of a regular workstation equipped with a GPU and an HD acquisition board. The HD video feed is captured through the acquisition board and the image is transferred to the GPU that performs part of the processing. At the end, the resulting corrected image is displayed onto the visualization system through the OpenGL buffer. Figure 4 illustrates the processing steps for each frame of the video feed.
The system runs in a heterogeneous environment with one POSIX thread [20] handling the GPU device calls and the other performing the serialized CPU processing necessary to extract the meaningful region boundary of the image mentioned in Sect. 2.
The system is divided into four main processing blocks:
-
Colorspace conversion: After transferring the image into the GPU (in YUV422 format), a colorspace and grayscale conversion is performed. Each RGB channel plus the grayscale value are written to the global memory and later bound to the texture memory space of the GPU.
-
Boundary detection: Using the grayscale image and the previous boundary estimation parameters from the CPU thread, the boundary contour points are extracted using the procedure described in [13] and the result is passed to the CPU to compute the boundary for the next iteration.
-
RD correction: Using the R, G and B channel textures and the previous boundary estimation parameters, the RD is corrected on the GPU and the result is written to the OpenGL global memory buffer for visualization.
-
CPU thread: The CPU thread is responsible for robustly estimating the boundary contour from a set of contour points extracted on the GPU. This procedure involves a RANSAC [16], low pass, and EKF [17] to robustly fit an ellipse to the boundary contour and estimate the lens rotation parameters.
Figure 5 shows the processing blocks execution sequence each time a frame arrives from the video stream, where we can observe the concurrent execution on the CPU and GPU. Note that the execution is perfectly balanced between CPU and GPU when the boundary detection − CPU time approaches the RD correction + result image display + image acquisition + colorspace conversion time.
3.1 Image acquisition
The video feed is captured directly from the endoscope’s control unit video output using the YUV422 transmission format. Taking human perception into account for chrominance components, the YUV422 format encodes 2 RGB pixels into a single YUV quadruple. This is of great importance when implementing real-time systems since this video format significantly reduces the necessary bandwidth for transmission and, consequently, the latency of the video stream without compromising the image quality. Moreover, as shown in Fig. 6, the memory alignment of the YUV422 image is perfectly suited to fulfil the GPU’s optimized memory access patterns presented in this article.
3.2 Heterogeneous processing
As opposed to previous works [4–6, 8–12], where only the problem of RD correction using a static projection model is solved, we address the RD correction under projection model changes due to the possible endoscopic probe rotation. The update of the projection model requires additional computation for determining the boundary contour of the meaningful region of the image and the relative lens rotation [13]. To achieve higher processing performance, we execute both GPU and CPU code concurrently. As shown in Figs. 4 and 5, we split the processing into two POSIX threads: (1) Pthread 1 is responsible for acquiring the image and performing the CUDA API calls for converting the image colorspace, extracting the boundary contour points, and correcting the RD; (2) Pthread 2 is responsible for performing the serial part of the boundary contour estimation. This includes a RANSAC, low pass, and EKF operations that are detailed in [13]. The high processing frame rate of the system (more than 250 fps, as depicted in Sect. 4) allows the RD correction of the current frame (at time t) based on the boundary parameters of the previous frame (at time t − 1) without compromising the accuracy of the correction. By using both the CPU and GPU concurrently, we are able to hide the serialized CPU processing workload, as shown in the results of Sect. 4, and therefore substantially increase the system’s performance.
3.3 Efficient GPU memory accesses
The GPU section of the system presented in Fig. 4 carries most of the workload for correcting an HD frame. Since the RD correction problem is mainly memory bound, we devised efficient memory accesses to/from the slow device’s global memory to hide data accesses’ latency. The optimization of the device’s global memory accesses is based on a specific memory alignment procedure, known as coalescence that allows reducing the global number of memory accesses. In this way, threads that are processed simultaneously in batches of 16 (known as half warps) by one multiprocessor can perform the corresponding memory accesses during the same clock cycle.
In the colorspace conversion kernel of Fig. 4, each thread of a half warp accesses the global memory data as a 16-byte aligned array corresponding to four YUV422 quadruplets (Fig. 6). Each quadruplet is decomposed into two RGB pixels and the data are packed into 8-byte words for writing in global memory (each channel and the grayscale value are stored into different memory locations). In this way, the 16 threads of a half warp read a total amount of 16 × 16 = 256 byte data and write 16 × 8 = 128 byte for each image channel plus the grayscale image into the global memory. Since data are perfectly aligned, the global memory read/writes are totally coalesced into single memory load/store accesses.
In the RD correction kernel, each thread of a half warp fetches four texture values from the texture memory of the GPU and interpolates the result using the built-in bilinear interpolation hardware. The retrieved values are interlaced into 4 RGBA quadruplets and therefore the write operations requested from the 16 threads of a half warp are coalesced into a single 256-byte memory transaction (Fig. 7).
4 Experimental results
4.1 Experimental setup
Since the heaviest workload is distributed on the GPU, we conducted a series of experiments using different GPUs and different HD resolution inputs. We performed experimental tests on four Nvidia GPUs belonging to three distinct architectures: (1) a GTX580 (Fermi architecture) with 16 multiprocessors and a total of 512 CUDA cores running at a clock speed of 1,544 MHz; (2) a high-end C2050 (Fermi architecture) with 14 multiprocessors and a total of 448 CUDA cores running at a clock speed of 1,150 MHz; (3) a 9800GT (G80 architecture) with 14 multiprocessors and 112 CUDA cores running at 1,500 MHz; and (4) a GTX260M (GT200 architecture) with 14 multiprocessors and 112 CUDA cores at 1,375 MHz. For each different hardware, we tested the code using the uncoalesced accesses to the GPU’s global memory implementation and also the optimized coalesced version on a sequence of 450 frames. The code is written in C++ using CUDA 4.0.
4.2 Time profiling
Figure 8 compares the processing time of four different implementations of our distortion correction algorithm: (1) a naive purely CPU based solution; (2) a hypothetical CPU version using OpenMPFootnote 1 directives [21]; (3) our heterogeneous approach using a GTX580 GPU without efficient memory accesses; and (4) our heterogeneous approach using a GTX580 GPU and efficient memory access patterns. The CPU used in the experiment is an Intel\(^{\circledR}\)CoreTM2 Quad CPU running at 2.40 GHz. The comparison given in Fig. 8 shows that the CPU is not able to handle the distortion correction in HD images even when parallelizing the code throughout the multiple CPU cores.
Figure 9 and Table 1 show the mean time needed to process each frame of the input video stream at different resolutions using the four GPUs mentioned above. The times were computed by correcting a sequence of 450 endoscopic video frames and computing the mean time per frame for each resolution used. It can be seen that the system can handle full HD resolution at frame rates above 60 Hz when using the efficient global memory access patterns. The best processing time for a 1,920 × 1,080 video resolution is achieved with the coalesced implementation in the GTX580 GPU. With this setup, the system is capable of correcting the RD of endoscopic images at a frame rate of approximately 250 fps (500 Mpixels/s throughput).
Figure 11 shows the temporal profile of each part of the system individually. It can be seen that, as expected, the use of efficient memory access patterns significantly decreases the processing time of the colorspace conversion and RD correction kernels. Note that the boundary detection on the CPU (textured bar) is overlapped because it runs concurrently with the GPU code (see Figs. 4 and 5).
4.3 Scalability
Concerning the computation on the GPU presented in Fig. 11 and Table 2, we expect a lower gain in performance when applying our efficient memory access patterns on the C2050 and GTX580 GPUs, since Fermi architectures perform intrinsic memory access optimizations when accessing misaligned data from the global memory space. By coalescing data accesses to global memory, we obtain gains of 6.6 and 3.7 % in the kernel execution time for the C2050 and GTX580 GPUs, respectively, and approximately 25 % for the older GPUs. This represents a 7 and 63 % reduction in the total computation time for the Fermi and G80/GT200 architectures, respectively.
The graphic of Fig. 10 shows the performance of our solution as a function of the number of processing cores in the GPU. It can be seen that, by using a GTX580 GPU with 512 CUDA cores, we achieved a processing time 19.5 % inferior to the time of the system equipped with a C2050 GPU (448 CUDA cores). Figure 10 shows that the proposed solution is scalable and that it should suit future requirements of this type of medical imaging systems that expectedly will consist of higher HD image resolutions and frame rates.
4.4 Discussion
Table 2 shows the difference in GPU occupancy while using the efficient memory access patterns proposed. We can observe that, since the coalesced accesses to the memory significantly reduce the transfer times, the overall time is decreased and the GPU occupancy is more balanced across data transfers and kernel executions.
As shown in Fig. 5, as long as the boundary detection on the CPU (running on Pthread 2) does not exceed the sum of the image acquisition, colorspace conversion, RD correction, and image display processing times (running on Pthread 1), the CPU computation of the boundary is entirely hidden by the GPU processing. For example, observing Fig. 11d, at 576p resolution for a GTX580 GPU, we can see that the CPU time (grey bar) is higher than the concurrent GPU stages execution time (image acquisition + colorspace conversion + RD correction + image display). In this case, the CPU is the bottleneck of the proposed system performance. On the other hand, in most of the remaining setups, the GPU execution is always higher than the CPU execution. The implementation of a heterogeneous system significantly increased the overall performance of the system, truly balancing the workload distribution between CPU and GPU.
5 Conclusion and future work
In this article, we proposed a software-based system for correcting the RD in endoscopic images that is capable of correcting 1080p HD images at 250 fps. The proposed solution is based on a heterogeneous parallel computing architecture that uses both the CPU and the GPU concurrently to process the HD video feed, and not only corrects the RD but also adapts the projection model according to the endoscopic lens rotation. Moreover, we perform memory access optimizations on the GPU that turn out to be fundamental for achieving higher processing frame rates and real-time execution on both new and older GPU architectures. With this work, we proved that a careful and efficient usage of conventional hardware outperforms current software-based solutions and competes with dedicated hardware-based and heterogeneous cell implementations of the RD correction in wide angle lens. Our solution is scalable and will support GPUs with even more processing cores, reducing the video processing times and potentially supporting upcoming video systems, such as 4kUHD (3,840 × 2,160) or 8kUHD (7,680 × 4,320).
The presented HD image-processing pipeline can be extended for purposes other than RD correction, such as stereo reconstruction or visual SLAM for computer-assisted surgery. As future work, we will port the code to many-core systems (multiple CPUs/GPUs, for example) to increase the computational capabilities of the system and support more complex image processing in the pipeline.
Notes
OpenMP can be used for shared-memory architectures, such as conventional commercially available off-the-shelf many-core CPUs of the x86 family. As the workstation CPU has four processor cores, OpenMP can be used to parallelize the serial code into all the cores [21]. Doing so would allow achieving a theoretical maximum speedup of 4x relatively to the current CPU code (although the speedup achieved with this kind of parallelization usually does not reach those theoretical maximum values).
References
Soper, N.J.: Mastery of Endoscopic and Laparoscopic Surgery. Lippincott reverend and adventurer, Dhaka (2008)
Pierre, S.A., Ferrandino, M.N., Simmons, W.N., Fernandez, C., Zhong, P., Albala, D.M., Preminger, G.M.: High definition laparoscopy: objective assessment of performance characteristics and comparison with standard laparoscopy. J. Endourol. 23(3), 523–528 (2009)
Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2004). ISBN 0521540518
Asari, K., Kumar, S., Radhakrishnan, D.: A new approach for nonlinear distortion correction in endoscopic images based on least squares estimation. IEEE Trans. Med. Imaging 18, 345–354 (1999)
Vogt, F., Krüger, S., Niemann, H., Schick, C.H.: A system for real-time endoscopic image enhancement. MICCAI, LNCS 2879, pp. 356–363 (2003)
Hartley, R., Kang, S.B.: Parameter-free radial distortion correction with centre of distortion estimation. In: ICCV, vol. 2, pp. 1834–1841 (2005)
Jeught, S.V., Buytaert, J., Dirckx, J.: Real-time geometric lens distortion correction using a graphics processing unit. Opt. Eng. 51, 1–5 (2012)
Asari, K.V.: Design of an efficient VLSI architecture for non-linear spatial warping of wide-angle camera images. J. Syst. Archit. 50(12), 743–755 (2004)
Ngo, H.T., Asari, V.K.: A pipelined architecture for real-time correction of barrel distortion in wide-angle camera images. IEEE Trans. Circuits Syst. Video Technol. 15(3), 436–444 (2005)
Chen, P.-Y., Huang, C.-C., Shiau, Y.-H., Chen, Y.-T.: A VLSI implementation of barrel distortion correction for wide-angle camera images. Trans. Circuits Syst. 56, 51–55 (2009)
Chen, S.-L., Huang, H.-Y., Luo, C.-H.: Time multiplexed VLSI architecture for real-time barrel distortion correction in video-endoscopic images. IEEE Trans. Circuits Syst. Video Technol. 21(11), 1612–1621 (2011)
Daloukas, K., Antonopoulos, C.D., Bellas, N., Chai, S.M.: Fisheye lens distortion correction on multicore and hardware accelerator platforms. In: IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp. 1 –10 (2010)
Melo, R., Barreto, J.P., Falcao, G.: A new solution for camera calibration and real-time image distortion correction in medical endoscopy: initial technical evaluation. IEEE Trans. Biomed. Eng. 59(3), 634–644 (2012)
Barreto, J., Roquette, J., Sturm, P., Fonseca, F.: Automatic camera calibration applied to medical endoscopy. In: BMVC, London (2009)
Fitzgibbon, A.W.: Simultaneous linear estimation of multiple view geometry and lens distortion. In: CVPR, vol. 1, pp.125–132 (2001)
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24, 381–395 (1981)
Bozic, S.M.: Digital and Kalman Filtering. Edward Arnold, London (1979)
NVIDIA.: CUDA Programming Guide 4.0, June 2011. Rev 1
Kirk, D., Hwu, W.: Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann, San Francisco (2010)
Butenhof, D.R.: Programming with POSIX Threads. Addison-Wesley Longman Publishing Co. Inc., Boston (1997)
Chapman, B., Jost, G., Van Der Pass, R.: Using OpenMP: Portable Shared Memory Parallel Programming (Scientific Computation and Engineering Series). The MIT Press, Cambridge (2008)
Acknowledgments
This work was supported by the Portuguese Agency for Innovation (ADI) through the QREN co-promotion project 2009/003512 sponsored by the Operational Program “Factores de Competitividade” and the European Fund for Regional Development.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Melo, R., Falcao, G. & Barreto, J.P. Real-time HD image distortion correction in heterogeneous parallel computing systems using efficient memory access patterns. J Real-Time Image Proc 11, 83–91 (2016). https://doi.org/10.1007/s11554-012-0304-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11554-012-0304-3