1 Introduction

Endoscopic images are usually acquired using a flexible or rigid lens coupled to a CCD sensor. The lens is introduced into the patient’s body through a small port for visualizing the anatomical cavities during surgery and diagnosis. In this type of procedure, the endoscopic video is the only guidance for the medical practitioner and therefore the system has to provide the best imaging quality possible [1]. The latest endoscopic systems use 1080p video (1,920 × 1,080 pixels/frame) at 60 Hz as the standard for visualization. The HD resolution and the high frame rate allow an enhanced visualization of the structures and therefore are likely to significantly improve the surgeon’s perception [2].

Due to the small size of the lens, endoscopic images present strong RD (also known as barrel distortion). RD is a nonlinear geometric deformation of the image that moves the points radially toward the center and can severely affect the notion of depth [3]. Several authors have addressed the problem of RD correction in wide angle lenses [47], but these either are not suited for real-time HD processing [4, 5] or do not fully model the endoscopic camera [6, 7].

Advances in very large-scale integration (VLSI) systems using the distortion model estimation proposed in [4] showed promising results in the correction of RD using dedicated hardware. Asari presented in [8] an efficient VLSI architecture to correct the RD in wide-angle camera images by mapping the algorithmic steps onto a linear array. Later, in [9], a pipelined architecture was presented that was able to process images at a rate of 30 Mpixels/s. In [10], the authors proposed a VLSI implementation for RD correction that reduced in 61 % the number of cells compared to [9] and achieved a throughput of 40 Mpixels/s. The recent work presented in [11] reduced at least 69 % hardware cost and 75 % memory requirement compared to previous works. In [12], the authors presented a comparison of RD correction implementations on a homogeneous multi-core processor, a heterogeneous cell broadband engine, and an FPGA. They concluded that only an FPGA and a fully optimized version of the code running on the cell processor could provide real-time processing speed (30 fps for input images of 2,592 × 1,944, which translates into a throughput of 150 Mpixels/s).

While previous software-based implementations fail to process large amounts of data in real time or do not fully model the endoscopic camera, hardware-based solutions lack the versatility to adapt to different devices or lenses (and therefore changes in the projection model in real time) and involve additional costs and effort to implement.

In this work, we propose a system for acquiring and processing the HD video feed from an endoscope in real time using a conventional PC equipped with an acquisition board and a GPU. Our solution is based on the work of [13] that updates the endoscopic camera projection model [6], according to the possible lens rotation at each frame time instant. Our system acts like a plug-and-play module that captures the video feed, processes each frame on a regular PC, and then outputs the result back into the existing visualization system (see Fig. 1). We verify that a homogeneous multi-core CPU is not capable of supporting HD real-time video distortion correction, as observed in [12], and the GPU-based implementation of [13] also fails to deliver the necessary frame rates for the latest endoscopic devices. Our framework for correcting the radial distortion of a HD video stream is based on a heterogeneous implementation that uses both CPU and GPU concurrently. We demonstrate that a hybrid solution, where the computational workload is distributed across the CPU and the GPU in parallel, enables the processing of the video feed (1,920 × 1,080 pixels/frame) at frame rates up to 250 fps (500 Mpixels/s throughput) when implementing efficient memory access patterns on the GPU side of the heterogeneous parallel system.

Fig. 1
figure 1

Proposed system scheme. The video feed is captured directly from the video output of the acquisition device, processed in our heterogeneous system, and then sent back into the existing visualization system

2 Radial distortion correction in clinical endoscopy

This article is closely related to the work presented in [13]. While [13] describes the camera projection model, the calibration of the endoscope in the OR, and the estimation of the relative rotation of the lens scope, the current article addresses in detail the problem of efficient implementation for real-time execution in HD endoscopic devices.

In [13], the endoscopic camera is calibrated from a single image of a chessboard pattern [14] with the radial distortion being described by the so-called division model [15] that uses a single parameter ξ to quantify the amount of image deformation. Consider \(\mathbf{X}\) the 3D coordinate of a point in the world reference frame. The corresponding point \(\mathbf{x}^{\prime}\) in the image plane is determined by the projection Eq. 1:

$$ {\mathbf{x}}^{\prime} \sim {\mathsf{K}} \boldsymbol{{\Upgamma}}_{\xi}({\mathsf{P}} {\mathbf{X}}), $$
(1)

where ∼ denotes an equality up to a scale factor, \(\mathsf{K}\) is the well-known intrinsics matrix obtained by the camera calibration and \(\mathsf{P}\) denotes the standard 3 × 4 projection matrix [3]. \({\varvec{{\Upgamma}}}_{\xi}\) is the radial distortion non-linear function that maps world undistorted points \(\mathbf{x}_{\rm u} \sim (x_{\rm u}\ y_{\rm u}\ z_{\rm u})^{\rm T}\) into the corresponding world distorted point:

$$ {\boldsymbol{\Upgamma}}_{\xi}({\mathbf{x}}_{\rm u}) \sim {{\left( 2x_{\rm u}\ 2y_{\rm u}\ z_{\rm u} + \sqrt{z_{\rm u}^2 - 4 \xi (x_{\rm u}^2 +y_{\rm u}^2)}\right)}}^{\mathsf{T}} . $$
(2)

Assuming that the 3D point \(\mathbf{X}\) is represented in the camera reference frame, \(\mathsf{P} \sim (\mathsf{I}_{3 \times 3} \ 0_{3 \times 1})\) and therefore we can compute the distorted image coordinates of an undistorted image point \(\mathbf{x}_u^{\prime}\) (in pixels) by:

$$ F({\mathbf{x}}^{\prime}_{\rm u}) \sim {\mathsf{K}} \boldsymbol{{\Upgamma}}_{\xi}({{{\mathsf{K}}}}^{\mathsf{-1}}_y {\mathbf{x}}_{\rm u}^{\prime}). $$
(3)

where \({{\mathsf{K}}}^\mathsf{-1}_y\) maps the undistorted image point \(\mathbf{x}_{\rm u}^{\prime}\) into a canonical plane, specifying certain desired characteristics of the undistorted image (e.g., center, resolution) [13].

The camera calibration changes during operation because the doctor rotates the lens scope with respect to the CCD head. As discussed in [13], the problem can be solved by considering an adaptive projection model that takes into account this relative rotation. The authors devised an efficient algorithm for extracting the image boundary and detecting the lens mark (Fig. 2) that relies on the extraction of boundary contour points on the GPU, as well as standard methods implemented on the CPU (such as RANSAC [16] and Kalman filtering [17]) to deliver a robust estimation of the rotation parameters. With this new adaptive projection model, the intrinsics matrix is updated by a rotation around the lens rotation center \(\mathbf{q}\) and the distortion mapping of Eq. 3 becomes:

$$ F({\mathbf{x}}^{\prime}_{\rm u}) \sim {\mathsf{K}}_i \boldsymbol{{\Upgamma}}_{\xi}({\mathsf{R}}_{-\alpha_i,{\mathbf{q}}_i^{\prime\prime}}{{{\mathsf{K}}}}^{\mathsf{-1}}_y {\mathbf{x}}^{\prime}_{\rm u}). $$
(4)

where \(\mathsf{K}_i \sim \mathsf{R}_{\alpha_i,\mathbf{q}_i} \mathsf{K}\) is the intrinsics matrix updated according to the lens rotation α at time i and \(\mathsf{R}_{-\alpha_i,\mathbf{q}_i^{\prime\prime}}\) is a rotation matrix that rotates the warping result back to the original orientation.

Fig. 2
figure 2

RD correction of an imaged chessboard pattern. In order to adapt the projection model to the lens rotation, we have to determine the meaningful region boundary \(\Upomega, \) the triangular mark that indicates the relative rotation α between the camera head and the lens, and the lens rotation center coordinates q in the image

Figures 2 and 3 show the results of the RD correction in different environments using the mapping function in Eq. 3, where the effects of the distortion are easily noticed.

Fig. 3
figure 3

Result of the RD correction in different environments. The left column shows the original image and the right column the image after distortion correction

3 Proposed system

GPUs have emerged as powerful processors in the last few years. The recent introduction of the CUDA interface [18] has enabled the scientific community to parallelize computationally intensive algorithms and to achieve faster execution times [19]. The Nvidia GF100 and subsequent architectures (also known as Fermi) introduced significant improvements in memory accesses and a drastic increase in compute capability when compared with the previous G80 and GT200 families. In Sect. 4, we perform experiments using both older architectures (G80 and GT200) and the more recent Fermi family.

The proposed system consists of a regular workstation equipped with a GPU and an HD acquisition board. The HD video feed is captured through the acquisition board and the image is transferred to the GPU that performs part of the processing. At the end, the resulting corrected image is displayed onto the visualization system through the OpenGL buffer. Figure 4 illustrates the processing steps for each frame of the video feed.

Fig. 4
figure 4

Processing stages of the RD correction system. The system runs in two POSIX threads. Pthread 1 is responsible for the acquisition and processing of the acquired frame on the GPU. Pthread 2 is responsible for the serial parts of the algorithm running on the CPU. The threads are synchronized trough conditional variables placed at the red horizontal dashed lines. Pthread 2 is launched at t 1 and delivers the previous boundary estimation \(\Upomega_{t-1}\) as well as the rotation parameters (αq) to Pthread 1 that waits at t 2. In this way, the system processes the current frame based on the previous boundary estimation

The system runs in a heterogeneous environment with one POSIX thread [20] handling the GPU device calls and the other performing the serialized CPU processing necessary to extract the meaningful region boundary of the image mentioned in Sect. 2.

The system is divided into four main processing blocks:

  • Colorspace conversion: After transferring the image into the GPU (in YUV422 format), a colorspace and grayscale conversion is performed. Each RGB channel plus the grayscale value are written to the global memory and later bound to the texture memory space of the GPU.

  • Boundary detection: Using the grayscale image and the previous boundary estimation parameters from the CPU thread, the boundary contour points are extracted using the procedure described in [13] and the result is passed to the CPU to compute the boundary for the next iteration.

  • RD correction: Using the R, G and B channel textures and the previous boundary estimation parameters, the RD is corrected on the GPU and the result is written to the OpenGL global memory buffer for visualization.

  • CPU thread: The CPU thread is responsible for robustly estimating the boundary contour from a set of contour points extracted on the GPU. This procedure involves a RANSAC [16], low pass, and EKF [17] to robustly fit an ellipse to the boundary contour and estimate the lens rotation parameters.

Figure 5 shows the processing blocks execution sequence each time a frame arrives from the video stream, where we can observe the concurrent execution on the CPU and GPU. Note that the execution is perfectly balanced between CPU and GPU when the boundary detection − CPU time approaches the RD correction + result image display + image acquisition + colorspace conversion time.

Fig. 5
figure 5

Image-processing time-line sequence for a generic video stream

3.1 Image acquisition

The video feed is captured directly from the endoscope’s control unit video output using the YUV422 transmission format. Taking human perception into account for chrominance components, the YUV422 format encodes 2 RGB pixels into a single YUV quadruple. This is of great importance when implementing real-time systems since this video format significantly reduces the necessary bandwidth for transmission and, consequently, the latency of the video stream without compromising the image quality. Moreover, as shown in Fig. 6, the memory alignment of the YUV422 image is perfectly suited to fulfil the GPU’s optimized memory access patterns presented in this article.

Fig. 6
figure 6

Memory access pattern per thread for the colorspace conversion kernel. Each thread in a half warp accesses a 16-byte word from the global memory (four YUV422 quadruples). For each YUV422 quadruple, the thread computes the two corresponding R, G, B and grayscale values and packs them into 8-byte words that are written to the corresponding global memory location. Since the data are aligned, the 16 threads of a half warp read a total amount of 16 × 16 = 256 byte and write 16 × 8 = 128 byte for each image channel plus the gray scale using single memory load/store instructions

3.2 Heterogeneous processing

As opposed to previous works [46, 812], where only the problem of RD correction using a static projection model is solved, we address the RD correction under projection model changes due to the possible endoscopic probe rotation. The update of the projection model requires additional computation for determining the boundary contour of the meaningful region of the image and the relative lens rotation [13]. To achieve higher processing performance, we execute both GPU and CPU code concurrently. As shown in Figs. 4 and 5, we split the processing into two POSIX threads: (1) Pthread 1 is responsible for acquiring the image and performing the CUDA API calls for converting the image colorspace, extracting the boundary contour points, and correcting the RD; (2) Pthread 2 is responsible for performing the serial part of the boundary contour estimation. This includes a RANSAC, low pass, and EKF operations that are detailed in [13]. The high processing frame rate of the system (more than 250 fps, as depicted in Sect. 4) allows the RD correction of the current frame (at time t) based on the boundary parameters of the previous frame (at time t − 1) without compromising the accuracy of the correction. By using both the CPU and GPU concurrently, we are able to hide the serialized CPU processing workload, as shown in the results of Sect. 4, and therefore substantially increase the system’s performance.

3.3 Efficient GPU memory accesses

The GPU section of the system presented in Fig. 4 carries most of the workload for correcting an HD frame. Since the RD correction problem is mainly memory bound, we devised efficient memory accesses to/from the slow device’s global memory to hide data accesses’ latency. The optimization of the device’s global memory accesses is based on a specific memory alignment procedure, known as coalescence that allows reducing the global number of memory accesses. In this way, threads that are processed simultaneously in batches of 16 (known as half warps) by one multiprocessor can perform the corresponding memory accesses during the same clock cycle.

In the colorspace conversion kernel of Fig. 4, each thread of a half warp accesses the global memory data as a 16-byte aligned array corresponding to four YUV422 quadruplets (Fig. 6). Each quadruplet is decomposed into two RGB pixels and the data are packed into 8-byte words for writing in global memory (each channel and the grayscale value are stored into different memory locations). In this way, the 16 threads of a half warp read a total amount of 16 × 16 = 256 byte data and write 16 × 8 = 128 byte for each image channel plus the grayscale image into the global memory. Since data are perfectly aligned, the global memory read/writes are totally coalesced into single memory load/store accesses.

In the RD correction kernel, each thread of a half warp fetches four texture values from the texture memory of the GPU and interpolates the result using the built-in bilinear interpolation hardware. The retrieved values are interlaced into 4 RGBA quadruplets and therefore the write operations requested from the 16 threads of a half warp are coalesced into a single 256-byte memory transaction (Fig. 7).

Fig. 7
figure 7

Memory access pattern per thread for the RD correction kernel. For each group of 4 RGBA pixels, the radial distortion kernel thread computes the corresponding locations in the distorted space. As the resulting coordinates do not necessarily fall into the regular lattice of the input image, the data are retrieved through 2D texture memory fetches that, through the built-in interpolation hardware, perform the bilinear interpolation of the value. Each thread in a half warp fetches four elements of each channel texture and the result is packed into a 16-byte word (consisting of 4 RGBA pixels) and written to the global memory (that is mapped to an OpenGL buffer). The data to be written into the global memory by the 16 threads of a half warp is perfectly aligned and therefore the operation is coalesced into a single memory write instruction

4 Experimental results

4.1 Experimental setup

Since the heaviest workload is distributed on the GPU, we conducted a series of experiments using different GPUs and different HD resolution inputs. We performed experimental tests on four Nvidia GPUs belonging to three distinct architectures: (1) a GTX580 (Fermi architecture) with 16 multiprocessors and a total of 512 CUDA cores running at a clock speed of 1,544 MHz; (2) a high-end C2050 (Fermi architecture) with 14 multiprocessors and a total of 448 CUDA cores running at a clock speed of 1,150 MHz; (3) a 9800GT (G80 architecture) with 14 multiprocessors and 112 CUDA cores running at 1,500 MHz; and (4) a GTX260M (GT200 architecture) with 14 multiprocessors and 112 CUDA cores at 1,375 MHz. For each different hardware, we tested the code using the uncoalesced accesses to the GPU’s global memory implementation and also the optimized coalesced version on a sequence of 450 frames. The code is written in C++ using CUDA 4.0.

4.2 Time profiling

Figure 8 compares the processing time of four different implementations of our distortion correction algorithm: (1) a naive purely CPU based solution; (2) a hypothetical CPU version using OpenMPFootnote 1 directives [21]; (3) our heterogeneous approach using a GTX580 GPU without efficient memory accesses; and (4) our heterogeneous approach using a GTX580 GPU and efficient memory access patterns. The CPU used in the experiment is an Intel\(^{\circledR}\)CoreTM2 Quad CPU running at 2.40 GHz. The comparison given in Fig. 8 shows that the CPU is not able to handle the distortion correction in HD images even when parallelizing the code throughout the multiple CPU cores.

Fig. 8
figure 8

Comparing CPU and GPU execution times for correcting the radial distortion of HD images. The times represent the mean time needed to correct each frame of the video stream at different resolutions

Figure 9 and Table 1 show the mean time needed to process each frame of the input video stream at different resolutions using the four GPUs mentioned above. The times were computed by correcting a sequence of 450 endoscopic video frames and computing the mean time per frame for each resolution used. It can be seen that the system can handle full HD resolution at frame rates above 60 Hz when using the efficient global memory access patterns. The best processing time for a 1,920 × 1,080 video resolution is achieved with the coalesced implementation in the GTX580 GPU. With this setup, the system is capable of correcting the RD of endoscopic images at a frame rate of approximately 250 fps (500 Mpixels/s throughput).

Fig. 9
figure 9

Mean total time per frame of the system for different GPUs at different resolutions. Both implementations using coalesced and uncoalesced memory accesses are compared. The resulting output frame size of the system is equal to the input resolution. All four devices can process 1080p video resolution at 60 fps when using coalesced accesses to global memory

Table 1 Mean total time per frame in milliseconds for the different hardwares tested at the resolutions used in Fig. 9

Figure 11 shows the temporal profile of each part of the system individually. It can be seen that, as expected, the use of efficient memory access patterns significantly decreases the processing time of the colorspace conversion and RD correction kernels. Note that the boundary detection on the CPU (textured bar) is overlapped because it runs concurrently with the GPU code (see Figs. 4 and 5).

4.3 Scalability

Concerning the computation on the GPU presented in Fig. 11 and Table 2, we expect a lower gain in performance when applying our efficient memory access patterns on the C2050 and GTX580 GPUs, since Fermi architectures perform intrinsic memory access optimizations when accessing misaligned data from the global memory space. By coalescing data accesses to global memory, we obtain gains of 6.6 and 3.7 % in the kernel execution time for the C2050 and GTX580 GPUs, respectively, and approximately 25 % for the older GPUs. This represents a 7 and 63 % reduction in the total computation time for the Fermi and G80/GT200 architectures, respectively.

Table 2 GPU occupation percentage for host–device transfers (H-D), device–device transfers (D-D), and kernel executions for the colorspace conversion, boundary detection and RD correction (Kernels) in the different GPUs

The graphic of Fig. 10 shows the performance of our solution as a function of the number of processing cores in the GPU. It can be seen that, by using a GTX580 GPU with 512 CUDA cores, we achieved a processing time 19.5 % inferior to the time of the system equipped with a C2050 GPU (448 CUDA cores). Figure 10 shows that the proposed solution is scalable and that it should suit future requirements of this type of medical imaging systems that expectedly will consist of higher HD image resolutions and frame rates.

Fig. 10
figure 10

Execution time per frame of our system processing a 1080p video stream as a function of the number of cores available on the GPU

4.4 Discussion

Table 2 shows the difference in GPU occupancy while using the efficient memory access patterns proposed. We can observe that, since the coalesced accesses to the memory significantly reduce the transfer times, the overall time is decreased and the GPU occupancy is more balanced across data transfers and kernel executions.

As shown in Fig. 5, as long as the boundary detection on the CPU (running on Pthread 2) does not exceed the sum of the image acquisition, colorspace conversion, RD correction, and image display processing times (running on Pthread 1), the CPU computation of the boundary is entirely hidden by the GPU processing. For example, observing Fig. 11d, at 576p resolution for a GTX580 GPU, we can see that the CPU time (grey bar) is higher than the concurrent GPU stages execution time (image acquisition + colorspace conversion + RD correction + image display). In this case, the CPU is the bottleneck of the proposed system performance. On the other hand, in most of the remaining setups, the GPU execution is always higher than the CPU execution. The implementation of a heterogeneous system significantly increased the overall performance of the system, truly balancing the workload distribution between CPU and GPU.

Fig. 11
figure 11

Time profile of the processing stages for the system using coalesced (c.) and uncoalesced (u.) memory accesses to the GPU’s global memory. The colorspace conversion time also includes the transfer of the input image from the host to the device (GPU)

5 Conclusion and future work

In this article, we proposed a software-based system for correcting the RD in endoscopic images that is capable of correcting 1080p HD images at 250 fps. The proposed solution is based on a heterogeneous parallel computing architecture that uses both the CPU and the GPU concurrently to process the HD video feed, and not only corrects the RD but also adapts the projection model according to the endoscopic lens rotation. Moreover, we perform memory access optimizations on the GPU that turn out to be fundamental for achieving higher processing frame rates and real-time execution on both new and older GPU architectures. With this work, we proved that a careful and efficient usage of conventional hardware outperforms current software-based solutions and competes with dedicated hardware-based and heterogeneous cell implementations of the RD correction in wide angle lens. Our solution is scalable and will support GPUs with even more processing cores, reducing the video processing times and potentially supporting upcoming video systems, such as 4kUHD (3,840  ×  2,160) or 8kUHD (7,680  ×  4,320).

The presented HD image-processing pipeline can be extended for purposes other than RD correction, such as stereo reconstruction or visual SLAM for computer-assisted surgery. As future work, we will port the code to many-core systems (multiple CPUs/GPUs, for example) to increase the computational capabilities of the system and support more complex image processing in the pipeline.