1 Introduction

Maximum intensity projection (MIP) is a volume visualization technique for displaying the maximum density of 3D volume data for a given viewpoint and direction. The clinical usability of MIP has been extensively evaluated, and MIP has proved to be useful in the generation of angiographic images from computed tomography (CT) and magnetic resonance (MR) imaging data [1]. For MIP rendering, a virtual projection ray is generated and passed along the viewing direction through volume data, and the maximum intensity is acquired for the intensities of voxels along this projection ray. This maximum density value is stored for the corresponding pixel, and displayed on the screen. Figure 1 shows that the maximum density value, 230 is stored for the corresponding pixel, since this value is the maximum intensity along the projection ray, which is parallel to the viewing direction.

Fig. 1
figure 1

Process of MIP rendering. a Generation of a projection ray, and b determination of MIP value

3D volume data can be acquired from slice images of a human body taken by CT or MR. Using MIP rendering technique, clinically meaningful images can be acquired for human tissue, contrast-enhanced vessels, and skeletons. However, since there is no depth information in the MIP image, a doctor might frequently change the viewing direction to acquire a depth cue for diagnosis. Therefore, numerous methods have been researched for the fast generation of MIP imagery. First, a region of a volume data is removed during the preprocessing stage, when this region has low probability of being rendered in the MIP image. Schreiner et al. proposed a fast algorithm by projecting the high density first. All of the data was sorted according to its density value in the preprocessing step for a few minutes [24]. Mroz et al. proposed a 1%-lossy compression method, which removed about the half of the unnecessary volume data in the preprocessing stage [19]. Mroz et al. proposed a cell projection method. A cell is a data block that, after the removal of unnecessary data, is sorted according to the density along the viewing direction [20].

Object-order MIP rendering methods have been researched to improve the sampling process and efficiency of cache memory. To minimize the computational cost of trilinear sampling, shear-warp [5] and bilinear sampling [21] were applied. Mora et al. proposed an object-order MIP ray casting method with a theoretical analysis [18]. Kiefer et al. proposed an object-order MIP ray casting method, which improved the cache efficiency by propagating the ray along the inclined direction [6].

These researches can be summarized in a pipelined flow. First, volume data is divided into blocks. Unnecessary blocks are removed in the preprocessing step [19]. Since the result of MIP rendering is independent of the rendering order, data is sorted, and rendering is performed in this sorted order [20, 21, 24]. The processing is performed in each block by using block-based ray casting [6, 18], so that the efficiency of a cache memory is improved. In spite of these efforts, most previous methods required a long preprocessing step, and the rendering speed was only around 2 frame per second (fps).

With the recent advances of graphics processing unit (GPU) performance, many researches into volume data visualization have been conducted. In the past, volume visualization using 3D texture on GPU was only possible on a workstation [4], and it then became possible on a general GPU [22]. However, there are a few drawbacks to using a GPU for volume data visualization. First, the installation and development methodology of specialized libraries, such as CUDA (compute unified device architecture) [17] and DirectX [25], are required for the development and application of a visualization algorithm. A proper hardware unit with GPU must be installed for the use of the developed application in a clinical environment. Finally, since the memory size of a GPU is generally smaller than that of a CPU, a large dataset such as 4D-CT cannot be installed on a GPU, and direct volume rendering is impossible [27].

To overcome the problems of previous researches, Kye proposed a parallel algorithm using single instruction multiple data (SIMD) functions of CPU [7]. In this method, the parallelism of CPU was utilized with no preprocessing step. Since intensity comparison operations are frequently used in MIP to compute the maximum intensity, these operations are interpreted into branch operations in CPU, resulting in ineffient computations. On the other hand, an intensity comparison is processed in SIMD by fast arithmetic operations, resulting in a large improvement in a computation speed. In addition, many comparisons are concurrently processed, resulting in further gain in the performance improvement. As a result, interactive MIP rendering was possible using a CPU without the help of a GPU. However, the variations of a rendering speed were very large, according to the viewing direction.

In this paper, we propose a fast MIP algorithm using advanced vector extensions (AVX), by improving the method proposed by Kye [7]. In this method, a coordinate transformation is performed based on a shear-warp decomposition, so that the accesses of input medical volume data and output MIP image data are sequentially performed. This technique improves the efficiency of memory access. Intensity comparison operations, which are most frequently used in MIP, are replaced by fast parallelized arithmetic operations of AVX, resulting in the improvement of computational efficiency. Compared with the previous method [7], the visualization speed is much improved, since one operation handles a dataset four times. In this paper, we propose an efficient method of a matrix transposition. In the previous method [7], the efficiency of a memory access was improved, but matrix transformation was required in some viewing directions, such as a sagittal view. Since a matrix transformation is generally inefficient, MIP rendering time in a sagittal view was about four times longer than that in other viewing directions. In our method, we developed a matrix transposition method with 64 operations of a 16 × 16 matrix. In this paper, we prove that a matrix transposition of an n x n matrix using SIMD requires at least n log2 n operations. Experimental results showed that the speed variations were minimized by more than two times according to the viewing direction, compared with the previous method.

Consequently, our method performs real-time MIP visualization of 512 × 512 × 552 medical image data with 20 fps speed using a general and single-core CPU without any preprocessing step. This speed cannot be acquired from any other previous CPU-based MIP rendering. Our method can be used in clinical practice, without the help of a GPU. Our method does not require any specialized hardware, and can be used in any kind of general purpose hardware, such as a desktop personal computer (PC), or notebook PC.

The remainder of this paper is organized as follows. The next section describes SIMD operations and MIP implementations using SIMD operations. Section 3 presents the proposed method. Section 4 presents the results of the application of the proposed method to clinical datasets, along with a comparison with other previous methods. Finally, Section 5 summarizes the results of our findings, and discusses directions for future work.

2 Related works

In this chapter, Section 2.1 explains the SIMD instruction that is the basis of the proposed research, while Section 2.2 describes the MIP algorithm that utilizes it.

2.1 SIMD instruction

SIMD is a technology or hardware structure that is capable of simultaneously executing a single command in parallel on a plurality of data. Intel developed MultiMedia eXtension (MMX), which is the SIMD extension instruction set for Pentium in 1997. With MMX technology, 64-bit data can be calculated in parallel by division into the size desired by the user. For example, four 16-bit integer (short) or eight 8-bit integer (char) values can be concurrently computed Figs. 2, 3 and 4.

Fig. 2
figure 2

SIMD instruction to input multiple data to one 64-bit variable. a Eight 8-bit data input, and b four 16-bit data input

Fig. 3
figure 3

Multi data operation using SIMD

Fig. 4
figure 4

Max operation using SIMD

In addition, applying the max instruction in SIMD makes it possible to compare two sets of values and extract only larger values at once. Therefore, the inefficient conditional branch instruction is replaced by a simple arithmetic instruction. Applying the SIMD instructions to the MIP in this way makes it possible to obtain the maximum values efficiently.

However, there are preconditions for efficiently performing the SIMD instruction. First, the data must be continuously present in memory. In the above example, the data values 6, 8, 2 and 5 must be sequentially present in the memory. If the data is scattered in the memory, large inefficiency occurs when it is transferred to a SIMD register. Second, parallel computation should be performed in element units. In the above example, calculating the larger value of 6 and 3 (or 8 and 7), which is called a vertical operation, is efficient. However, getting the largest value 8 among 8, 6, 2 and 5 belonging to one register, which is called a horizontal operation, is very inefficient. In Section 2.2, we show a fast MIP implementation that satisfies these two preconditions.

2.2 Fast MIP using MMX instruction

In this section, we explain a MIP implementation of Kye’s method [7] that can use MMX instructions because of the transformation of the coordinate system. In advance, it is assumed that the coordinate axes of the volume data are denoted by X, Y, and Z axes, and the coordinate axes of the output image are denoted by i, j, and k axes. Therefore, the volume data is stored sequentially along the X-axis, and the image data is stored along the i-axis in the physical memory. The viewing direction of the observer is the k-axis. Each voxel (pixel in medical images) has a 2 byte integer value, which is the standard voxel size of CT or MR images.

If the viewing direction (k-axis) is exactly parallel to the Z-axis, it is simple to implement MIP using MMX. First, coordinate transformation is organized. Since the Z-axis corresponds to the k-axis, the X-axis of the volume can correspond to the i-axis of the image, and the Y-axis of the volume corresponds to the j-axis of the image.

Then, the four voxels sequentially existing in the X-axis correspond to four pixels in the i-axis on a one-to-one basis. Now, if we read four consecutive voxel data at once in a 8-byte MMX register and use the max instruction with four consecutive pixels in the image and update it, the process of the four voxels is completed at once without the branch instruction. When this process is repeated in each Y-axis and Z-axis direction, one output image is generated.

Algorithm 1. MIP algorithm for viewing along the Z-axis using MMX.

figure d

Similarly, when we observe the volume data in parallel with the Y-axis, the coordinate system is transformed as follows. Since the Y-axis corresponds to the k-axis, which is the viewing direction, the X-axis can correspond to the i-axis. MIP can be efficiently performed, as described above. For reference, the Z-axis automatically corresponds to the j-axis.

However, observation in the X-axis usually causes inefficiency. Four consecutive values in the X-axis of the volume data correspond to the k-axis, and they cannot be stored on the efficient i-axis. On the other hand, if we apply the ray casting method [23], it is also inefficient to parallelize the calculation of maximum value along the X-axis, because this is the horizontal operation described in Section 2.1.

To reduce the inefficiency, a related research [7] performed a matrix transposition operation on a 4 × 4 sized voxel unit (a tile or matrix). A tile in the X and Y-axis directions is stored in four MMX registers, and the X-axis and Y-axis are exchanged by the transposition. As a result, the X-axis corresponds to the i-axis, and it is possible to efficiently apply the SIMD instructions.

Since at this time the transposition takes a considerable portion of the total computation time, more improvements are needed. In this research, we propose a new method to transpose on a 16 × 16 sized tile using AVX2. The transposition on a 8 × 8 sized tile was studied in existing research [16], but the transposition on a 16 × 16 sized tile is first studied in this research.

3 The proposed method

Figure 5 shows the overall process of this research. Acceleration is performed using the CPU-based SIMD instructions (Section 3.1), and our coordinate system applies the shear-warp factorization (Section 3.2). In this process, different coordinate transformations are performed according to the observation direction (Section 3.3). When the observation direction is the X-axis, the fast transposition method is applied (Section 3.4).

Fig. 5
figure 5

A flowchart of the proposed MIP method

3.1 SIMD technologies of CPU and AVX2

In 1996, the first SIMD for PC was announced as the MMX technology of 8 bytes capacity. Since Pentium 3 was released in 1999, Streaming SIMD Extensions (SSE) has become available. SSE enables parallel execution of floating point instructions of 16 bytes. Integer data processing became possible in 2001 when the SSE2 was developed. Advanced Vector Extensions (AVX), introduced in 2008, can handle 32 bytes at once, and can handle floating point instructions. AVX2, which is available from Intel’s Sandy Bridge CPU released in 2011, can process integer data in parallel in units of 32 bytes.

In this research, the AVX2 instruction set is used to take care of 2-byte integer voxels. Since it can be used on CPUs after 2011, this research can be applied on most existing computers, without requiring additional equipment.

3.2 Shear-warp factorization

The user usually observes patient data in an oblique direction. In this research, shear-warp factorization [8, 9] is used for accelerated visualization. Shear-warp factorization divides the transformation of coordinate system into two stages: shearing and warping. We denote Mview as the matrix that converts a point (Volx, Voly, Volz) in the volume coordinate to a point (Imgx, Imgy, Imgz) in the image coordinate.

$$ \left[\begin{array}{c}\hfill {Img}_x\hfill \\ {}\hfill {Img}_y\hfill \\ {}\hfill {Img}_z\hfill \\ {}\hfill 1\hfill \end{array}\right]=\left[\begin{array}{c}\hfill \hfill \\ {}\hfill {M}_{view}\hfill \\ {}\hfill \hfill \\ {}\hfill \hfill \end{array}\right]\left[\begin{array}{c}\hfill {Vol}_x\hfill \\ {}\hfill \begin{array}{c}\hfill {Vol}_y\hfill \\ {}\hfill {Vol}_z\hfill \\ {}\hfill 1\hfill \end{array}\hfill \end{array}\right] $$
(1)

Then, shearing transformation is applied, so that the viewing direction becomes parallel to the Z-axis. As a result, the volume data is skewed in the sheared coordinate, as shown in Fig. 6. The matrix Mshear transforms each point in volume coordinate to the sheared coordinate (see Eq. (2)). Each patient image constituting the volume data (each colored bar) can be projected onto the base plane. We calculate the position on the base plane for each image, and project the image in the same way as in Section 2.2.

$$ \left[\begin{array}{c}\hfill {Shear}_x\hfill \\ {}\hfill \begin{array}{c}\hfill {Shear}_y\hfill \\ {}\hfill {Shear}_z\hfill \\ {}\hfill 1\hfill \end{array}\hfill \end{array}\right]=\left[\begin{array}{c}\hfill \hfill \\ {}\hfill {M}_{shear}\hfill \\ {}\hfill \hfill \\ {}\hfill \hfill \end{array}\right]\left[\begin{array}{c}\hfill {Vol}_x\hfill \\ {}\hfill \begin{array}{c}\hfill {Vol}_y\hfill \\ {}\hfill {Vol}_z\hfill \\ {}\hfill 1\hfill \end{array}\hfill \end{array}\right] $$
(2)
Fig. 6
figure 6

Shearing transformation along the viewing direction

The projected image in the sheared coordinate should be transformed to the output coordinate. We apply two-dimensional image processing as shown in Eq. (3) to generate the final output image. Since we have computed Mview and Mshear, we can calculate Mwarp, the product of Mview and Mshear −1. We can obtain the final output image by applying the warping transformation to the base plane.

$$ {\displaystyle \begin{array}{c}\hfill \left[\begin{array}{c}\hfill {Img}_x\hfill \\ {}\hfill \begin{array}{c}\hfill {Img}_y\hfill \\ {}\hfill {Img}_z\hfill \\ {}\hfill 1\hfill \end{array}\hfill \end{array}\right]=\left[\begin{array}{c}\hfill \hfill \\ {}\hfill\ {M}_{view}\hfill \\ {}\hfill \hfill \\ {}\hfill \hfill \end{array}\right]{\left[\begin{array}{c}\hfill \hfill \\ {}\hfill\ {M}_{shear}\hfill \\ {}\hfill \hfill \\ {}\hfill \hfill \end{array}\right]}^{-1}\left[\begin{array}{c}\hfill \hfill \\ {}\hfill\ {M}_{shear}\hfill \\ {}\hfill \hfill \\ {}\hfill \hfill \end{array}\right]\left[\begin{array}{c}\hfill {Vol}_x\hfill \\ {}\hfill \begin{array}{c}\hfill {Vol}_y\hfill \\ {}\hfill {Vol}_z\hfill \\ {}\hfill 1\hfill \end{array}\hfill \end{array}\right]\hfill \\ {}\hfill =\left[\begin{array}{c}\hfill \hfill \\ {}\hfill {M}_{view}\hfill \\ {}\hfill \hfill \\ {}\hfill \hfill \end{array}\right]{\left[\begin{array}{c}\hfill \hfill \\ {}\hfill {M}_{shear}\hfill \\ {}\hfill \hfill \\ {}\hfill \hfill \end{array}\right]}^{-1}\left[\begin{array}{c}\hfill {Shear}_x\hfill \\ {}\hfill \begin{array}{c}\hfill {Shear}_y\hfill \\ {}\hfill {Shear}_z\hfill \\ {}\hfill 1\hfill \end{array}\hfill \end{array}\right]=\left[\begin{array}{c}\hfill \hfill \\ {}\hfill {M}_{warp}\hfill \\ {}\hfill \hfill \\ {}\hfill \hfill \end{array}\right]\left[\begin{array}{c}\hfill {Shear}_x\hfill \\ {}\hfill \begin{array}{c}\hfill {Shear}_y\hfill \\ {}\hfill {Shear}_z\hfill \\ {}\hfill 1\hfill \end{array}\hfill \end{array}\right]\hfill \end{array}} $$
(3)

3.3 Rendering method for viewing directions

In this section, similar to Section 2.2, it is assumed that the coordinate axes of the volume data are denoted by X, Y, and Z axes, and the coordinate axes of the output image are denoted by i, j, and k axes. Even though the volume can be observed in the axial (Z-axis), coronal (Y-axis), or sagittal (X-axis) directions. it is efficient to make the X-axis correspond to the i-axis, because the SIMD can handle sequential data in the memory.

In observing along the Y-axis, the Y-axis corresponds to the k-axis. Therefore, the X, Y, and Z axes correspond to the i, k, and j axes, respectively. Images parallel to the X-Z plane are projected to the base plane, while increasing by 1 in the Y-axis (see Fig. 7b).

Fig. 7
figure 7

Transfomation of coordinate system and accessing order of volume data (red arrow). Observation in the a Z-axis, b Y-axis, c X-axis, and d X-axis after transposition

In observing along the X-axis, the X-axis corresponds to the k-axis, and cannot correspond to the i-axis (see Fig. 7c). In order to solve this inefficiency problem, we transpose all images parallel to the X-Y plane. As the result, the X, Y, and Z axes can correspond to the i, k, and j axes, respectively (see Fig. 7d).

Since voxels along the X-axis and pixels along the i-axis correspond one-to-one, the calculation of the coordinates becomes simple, as shown in Eq. (4). The position of the sheared coordinate when volx = 0 is calculated once, and the X vol is added to obtain the exact coordinates on the base plane. Coordinate calculation is completed with one addition operation, instead of a matrix multiplication operation.

$$ {\displaystyle \begin{array}{c}\hfill \left[\begin{array}{c}\hfill {Shear}_x\hfill \\ {}\hfill \begin{array}{c}\hfill {Shear}_y\hfill \\ {}\hfill {Shear}_z\hfill \\ {}\hfill 1\hfill \end{array}\hfill \end{array}\right]=\left[\begin{array}{c}\hfill \hfill \\ {}\hfill {M}_{shear}\hfill \\ {}\hfill \hfill \\ {}\hfill \hfill \end{array}\right]\left[\begin{array}{c}\hfill {Vol}_x\hfill \\ {}\hfill \begin{array}{c}\hfill {Vol}_y\hfill \\ {}\hfill {Vol}_z\hfill \\ {}\hfill 1\hfill \end{array}\hfill \end{array}\right]\hfill \\ {}\hfill =\left[\begin{array}{c}\hfill \hfill \\ {}\hfill {M}_{shear}\hfill \\ {}\hfill \hfill \\ {}\hfill \hfill \end{array}\right]\left[\begin{array}{c}\hfill 0\hfill \\ {}\hfill \begin{array}{c}\hfill {Vol}_y\hfill \\ {}\hfill {Vol}_z\hfill \\ {}\hfill 1\hfill \end{array}\hfill \end{array}\right]+\left[\begin{array}{c}\hfill \hfill \\ {}\hfill {M}_{shear}\hfill \\ {}\hfill \hfill \\ {}\hfill \hfill \end{array}\right]\left[\begin{array}{c}\hfill {Vol}_x\hfill \\ {}\hfill \begin{array}{c}\hfill 0\hfill \\ {}\hfill 0\hfill \\ {}\hfill 0\hfill \end{array}\hfill \end{array}\right]=\left[\begin{array}{c}\hfill {Shear}_x\left|{Vol}_x=0\right.\hfill \\ {}\hfill \begin{array}{c}\hfill {Shear}_y\hfill \\ {}\hfill {Shear}_z\hfill \\ {}\hfill 1\hfill \end{array}\hfill \end{array}\right]+\left[\begin{array}{c}\hfill {Vol}_x\hfill \\ {}\hfill \begin{array}{c}\hfill 0\hfill \\ {}\hfill 0\hfill \\ {}\hfill 0\hfill \end{array}\hfill \end{array}\right]\hfill \end{array}} $$
(4)

We summarize it in Algorithms 2 and 3. When observing along the Y-axis, the calculation is performed in units of 32 bytes along the X-axis. If one scanline along the X-axis is projected to the end, it proceeds along the Z-axis. If a plane along the X-Z plane is projected to the end, it proceeds along the Y-axis. When observing along the X-axis, a X-Y plane consists of several 16 × 16 tiles. Each tile is projected onto the base plane after transposition. If all tiles in a X-Y plane are projected to the end, it proceeds along the Z-axis.

Algorithm 2. MIP algorithm for viewing along the Y-axis using AVX2.

figure e

Algorithm 3. MIP algorithm for viewing along the X-axis using AVX2.

figure f

3.4 Fast matrix transposition

The transposition is needed only when observing on the X-axis, so that the observer feels that it unnaturally slows down. This problem can be solved by improving the efficiency of the transposition. In this section, we show a method of performing a 16 × 16 sized matrix transposition with 16 × log 2 16 = 64 instructions. It is proven below that this method uses the minimum number of instructions.

For example, one 8-byte MMX register can store four voxels, so a 4 × 4 matrix is stored in four registers, as shown in Fig. 8. If the input values for the four registers are {1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10, 11, 12}, and {13, 14, 15, 16}, respectively, {1, 5, 9, 13}, {2, 6, 10, 14}, {3, 7, 11, 15}, and {4, 8, 12, 16} must be obtained through transposition.

Fig. 8
figure 8

An example of 4 × 4 sized matrix transposition using SIMD [7]

The data exchanging operation of SIMD fills one output register by various combinations with two input registers. Because 1, 5, 9, and 13 are scattered in each of four different registers, it is impossible to obtain the result by one instruction. Therefore, at least two instructions are required. As shown in Fig. 8, {1, 5, 3, 7} and {9, 13, 11, 15}, which are intermediate values, are generated, then {1, 5, 9, 13} is obtained. The total number of instructions is three, but the intermediate results can be reused. The total number of instructions is 8, and two instructions are needed to generate each output register.

We need to gather the scattered values in the eight registers to transpose an 8 × 8 matrix. Since one more instruction to bind two registers gathered from the four input registers is required, at least three instructions for each output register are needed. Likewise, four instructions are necessary for a 16 × 16 matix, and at least log2 n operation is required to gather the scattered values in n registers for an n x n matrix. Therefore, at least n log2 n instructions are necessary to transpose the n x n matrix.

Since n log2 n instructions are required for n2 elements, the overhead for each element is \( \mathrm{ov}(n)=\frac{n{\mathrm{log}}_2n}{n^2}=\frac{n{\mathrm{log}}_2n}{n^2} \). This function is monotonically decreasing for n, and becomes \( \underset{\mathrm{n}\to \infty }{\lim}\mathrm{ov}(n)=\underset{\mathrm{n}\to \infty }{\lim}\frac{{\mathrm{log}}_2n}{n}=0 \). Therefore, as larger registers are used, n increases and overhead decreases, and the relative overhead to transpose a matrix converges to zero.

In related work [7], 4 × 4 voxels were considered as a tile. The transposition was implemented using the macro function provided by Intel. In this research, we developed a new method to transpose a 16 × 16 tile with 16 × log 2 16 = 64 instructions using AVX2, which supports 64-byte register.

First of all, this paper describes the transpose for an 8 × 8 tile using SSE2. One 16-byte SSE2 register can store 8 voxels, and 8 registers are used to store a 8 × 8 tile. For each 8 × 8 tile, eight instructions are used in one phase, and the transposition is completed in three phases, and 24 instructions [2]. Algorithm 4 shows an example of the transposition implemented in SSE2, while Fig. 9 shows the change of data when each phase is executed. The symbol UNPACK in Algorithm 4 is a kind of data swizzle instruction [8] which combines the parts of two registers into one register.

Fig. 9
figure 9

Input data a after phase1, b after phase2, c after phase3; and d the 8 × 8 transposition process using SSE2

Algorithm 4. 8 × 8 transposition.

figure g

In this research, we extended this concept, and developed a new transposition using AVX2. (For specific details, refer to the Appendix) A 32 byte AVX2 register can store 16 voxels, and 16 registers are used to transpose a 16 × 16 matrix. By executing four phases of the command, it is possible to perform the transposition of the 16 × 16 matrix with 16 × 4 = 64 instructions.

If the existing MMX technology is used to transpose a 16 × 16 matrix, it is necessary to transpose 4 × 4 matrices 16 times, so that 8 instructions × 16 times = 128 instructions are required. The proposed method uses only half (\( \frac{\mathrm{ov}(16)}{\mathrm{ov}(4)}=0.5 \)) of the instructions.

3.5 Warping

Figure 5 shows the final stage of this research, which is warping to generate the output image. Warping is an image processing technique that corresponds to two-dimensional affine translation, which is the process of correcting the tilted image created in the previous step. The transformation can be calculated as Eq. (3).

Since the forward mapping generates holes in image processing, we used the backward mapping using Mwarp −1 with bilinear interpolation. In modern hardware environments, two-dimensional warping takes less than 2 ms, and the influence on overall performance is very small.

4 Experimental results

4.1 Experimental environment and visual inspection

The experiments were performed using an Intel i5-4460 desktop system with a 3.2 GHz processor and 8 GB of main memory. We implemented our method using Visual Studio 2015 C++ on Window7 operating system. Table 1 shows our tests of the proposed method on four data sets. The number of images per scan ranged from 110 to 1165. Each image had a matrix size of 512 × 512. Pixel sizes ranged from 0.42 to 0.78 mm. CT (computed tomography) was performed with a 16-channel multi-detector row CT scanner (Siemens Somatom 16; Siemens, Forchheim, Germany) in all the subjects. The scanning parameters were 120 kVp, 200 mA, 0.5-s rotation time, and 512 × 512-pixel matrix with a detector-row configuration of 0.75 × 16.0 mm.

Table 1 Volume data set

We evaluated MIP rendering speed at various viewing directions, since the rendering performance varied according to the viewing direction. First, we randomly generated 21 viewing direction vectors, as shown in Table 2. Accordingly, each principal viewing direction (x-, y-, and z-direction) has 7 randomly generated viewing direction vectors around each principal viewing direction.

Table 2 Randomly generated viewing direction vectors

Figure 10 shows the MIP rendering results of test data sets. In addition, we evaluate the image quality of our MIP rendering methods by subtracting all of the MIP rendering images of the proposed method from those of MMX [7] and SSE2 [8], while each data set was rotated from 0 degree to 360 degree with an angle increase of 1 degree around the diagonal axis (1, 1, 1). None of the subtraction images from test data sets with 360 viewing directions had any non-zero intensity pixel. We concluded that the proposed method showed no loss of image quality.

Fig. 10
figure 10

MIP rendering images of test data sets. a Chest #1, b Chest #2, c Lung, d Head, e Lower extremity, and f Lower extremity with the other viewing direction

4.2 Comparison with the conventional methods

We compared the computational efficiency of the proposed method with those of conventional methods: MMX [7] and SSE2 [8]. Table 3 shows the average MIP rendering time for these three algorithms. We measured the average MIP rendering time at 21 viewing directions (Table 2) using Chest #1 data set. At each viewpoint, we performed 100 times rendering for accurate measurement. SSE2 [8] was faster than MMX [7], and the proposed method was faster than SSE2 [8]. Consequently, our method achieved about 1.55 times speedup of MIP rendering on average, without any loss of image quality, compared with the conventional method [7].

Table 3 MIP rendering time comparison with conventional methods (ms = 10−3 s)

To analyze the performance according to the viewing direction, Fig. 11 shows the rendering time of each method according to the viewing direction. Since our method and conventional methods are based on shear-warp decomposition, the rendering time varies greatly according to the viewing direction. Our method achieved about 1.1 times speedup for the y-axis or z-axis viewing direction, since the memory was sequentially accessed at these axes. For the x-axis viewing direction, our method achieved about 2 times speedup, compared with MMX [7]. MIP rendering of the x-axis viewing direction was slower than that of the other viewing directions, due to the matrix transposition. However, the fast transposition of our method enabled the speedup of MIP rendering. In the previous method [7], the rendering speed for the x-axis viewing direction was about 3.8 times slower than for the y-axis or z-axis viewing direction. On the other hand, the rendering speed for the x-axis viewing direction in our method was about 2.1 times slower than for the y-axis or z-axis viewing direction. Our method much improved the speed difference along the viewing direction in Fig. 11. Furthermore, the proposed method doesn’t require any time-consuming pre-processing or additional memory usage.

Fig. 11
figure 11

Comparison of MIP rendering time for each axis viewing direction, using MMX [7], SSE2 [8], and the proposed method

Next, we compared the processsing time of the matrix transposition of our method with those of MMX [7] and Zekri et al. [26]. Table 4 shows the average processing time after one hundred million times of the same computation for each method. This time does not include the memory copy time. The experimental results of Zekri et al. [26] in Table 4 were performed using an Intel i7-2670 desktop system with a 2.2 GHz processor, and the float operation was performed in 4 × 4 unit. The processing speed of our method was about two times faster than that of Zekri et al. [26]. With respect to the clock number, the performance of our method was about 1.5 times faster than that of Zekri et al. [26].

Table 4 Matrix transposition time comparison with the conventional methods (ns = 10−9 s)

In addition, we measured the MIP rendering time for various kinds of data sets to prove the roubustness of our method. Table 5 shows the average rendering time of each method after 100 times repetition for the x-axis direction (0.926509, 0.260581, 0.271438), y-axis direction (−0.131742, 0.951469, −0.278122), and z-axis direction (−0.102672, −0.667368, −0.737617), respectively. We found that the MIP rendering time linearly increased in proportion to the volume data size. For example, the size of the head data set was two times larger than that of the chest #1 data set, so that MIP rendering time of the head data set was about two times longer than that of the chest #1 data set. This characteristic is one feature of the shear-warp based method, which is impemented on volume data. We can approximately estimate the MIP rendering time if we use another volume data set. Consequently, our method achieved about 20 fps speed for MIP rendering of the 512 × 512 × 552 head data set using a general CPU.

Table 5 MIP rendering time comparison of test data sets with the conventional methods (μs = 10−6 s)

5 Conclusion and future Works

In this paper, we proposed a fast MIP algorithm using AVX, by improving the method proposed by Kye [7]. Our method performed a real-time MIP visualization of 512 × 512 × 552 medical image data with 20 fps speed, using a general and single-core CPU without any preprocessing step. In our method, we developed a matrix transposition method with 64 operations of a 16 × 16 matrix. Experimental results showed that compared with the previous method, the speed variations were minimized more than two times according to the viewing direction. Our matrix transposition method can be applied to other image processing algorithms for faster processing.

In our method, we used a single-core CPU, and interpolation was not applied. If we use multi-core CPU techniques, such as openMP technology, real-time MIP rendering might be possible with an interpolation. In addition, since AVX-512 is planned to be launched to handle 64 bytes in late 2017, 160 (32 × 5) operations can perform a matrix transposition of 32 × 32 matrix. In future work, our algorithm can be extened into the various kinds of applications in mobile platform [13, 15]. And our method can be applied into a real-world dataset [12, 14]. And our method can be utilized in tracking and visual analysis applications [3, 10, 11].

(The executable file with a sample dataset is provided at the author’s web-site: http://www.gilab.co.kr).