1 Introduction

Magnetic resonance imaging (MRI) modality is safe and efficient. This method has no radiation damages to patients and can achieve tomography in any direction with high soft-tissue contrast. Therefore, MRI has been widely used clinically. However, conventional MRI has a long sampling time, which may lead to patient discomfort and motion artifacts in the reconstruction images. Hence, the use of MRI may be limited in various areas, such as cardiac imaging and functional MRI.

One solution is to perform reconstruction from undersampled k-space data. This procedure can improve data acquisition speed by sampling less data. Compressed sensing (CS) theory [6, 7] suggests that a sparse signal can be reconstructed from its sparse representation under certain conditions. Therefore, reconstruction of MRI images can be achieved using undersampled k-space data. CS theory has been proved to have high quality reconstructions from undersampled measurements [9, 14, 24].

Dictionary learning method [8, 18, 21] is an extremely effective way to establish adaptive dictionaries with good sparsity. The dictionary can be used to train sparse representations of signals and reconstruct images. In our work, we utilize k-singular value decomposition (K-SVD) [1] and orthogonal matching pursuit (OMP) [5, 22, 23] algorithm to train dictionaries and obtain sparse representations.

Ying Song et al. [18] propose a new algorithm for MRI reconstruction using undersampled k-space data. They use three-dimensional dictionaries and perform reconstruction of multi-slice MRI images. In their method, the three-dimensional data are divided into blocks. Hence, spatial correlation among slices can be used as prior knowledge when training dictionaries and updating result images. Furthermore, they adopt a new dictionary learning scheme,dual-dictionary learning, with low-resolution dictionary \(D^{\text {low}}\) and high-resolution dictionary \(D^{\text {high}}\) for sparse coding and image updating, respectively. Their work indicates that dual-dictionary learning scheme is better than single dictionary learning scheme. In the image updating stage, replacing \(D^{\text {low}}\) with \(D^{\text {high}}\) results in significantly improved outputs. Consequently, the number of loops is reduced and procedure speeds up.

However, three-dimensional reconstruction copes with massive amount of data, which will increase with the increase in the number of slices. In addition, K-SVD and OMP are both iterative algorithms which are computationally expensive. The situation will become even worse when the amount of data increase. Therefore, accelerating the reconstruction procedure is needed. With the development of cheap hardware and parallel development tools, graphics processing unit (GPU)-based applications are widely used and are experiencing rapid development. References [24, 16] present how hardware and software can work in concert on scalable multi-processor systems with a number of illustrative examples and applications. In our work, we first design parallel algorithm on GPU directly under the scheme of compute unified device architecture (CUDA) [12] (we call this version of CUDA code as “original CUDA”). This design utilizes GPU’s strong computing power and high performance in parallel computing. Then, we carry out algorithmic optimization proposed in [15], which will serve as basis for the development of another version of CUDA code (called CUDA after Algorithmic Optimization, i.e., CUDA-AO). In both versions of CUDA codes, we emphasize the optimization and parallelization of K-SVD and OMP algorithm. We also implement the parallel version codes on CPU using open multi-processing (OpenMP) under nearly the same parallelization mechanism for CUDA to further verify the efficiency of our CUDA codes.

GPU has experienced staggering development in recent years. GPU exhibits strong advantages, including having a stronger computing power than CPU, being cheaper than CPU but having the same computing load, and natural parallelism. These features have contributed to its rapid growth and widely use in scientific computing and engineering fields. With the constant development of technology, the demand for real-time and high-resolution 3D imaging techniques has been increasingly growing. However, the deployment of advanced technologies is limited by processing time. GPU-based MR image reconstruction acceleration has received considerable research attention [10, 17, 19, 20]. In our work, we focus on a recent innovation in MRI reconstruction technique and design a new program that is applicable to GPU with CUDA architecture. With this approach, we are able to achieve large acceleration ratios.

CUDA is a programming model and general purpose parallel computing platform introduced by NVIDIA Corporation in November 2006. CUDA can utilize the parallel compute engines in NVIDIA GPUs to solve complex and tremendous computational tasks. This model allows programmers to easily develop programs on GPU without much knowledge of the GPU internal structure and parallelization mechanism of computing in threads. In CUDA, the smallest execution unit is called thread. Many threads are grouped into a block. Numerous blocks together form a grid. The threads in the same block can access the same shared memory, as well as be controlled to be synchronized. CUDA codes consist of the following two parts: host part is executed on CPU and device part is executed on GPU. The program should manage data transformation between CPU memory and GPU memory, which has relatively long time cost when processing small-scale data. The powerful computing capability of CUDA allows this model to be increasingly used in scientific computing areas. In our work, CUDA runtime API and CUDA Basic Linear Algebra Subroutines (CUBLAS) library [11], which is a GPU-accelerated version of the complete standard BLAS library, are used to develop CUDA programs.

The rest of this paper is organized as follows: Sect. 2 presents the reconstruction algorithm scheme for the three-dimensional MRI reconstruction. Section 3 introduces the parallel implementations on CUDA, including OMP, K-SVD, and reconstruction procedure. Section  4 shows the performance of the original CUDA method and CUDA-AO method. Section 5 draws the conclusions.

2 The reconstruction algorithm scheme

MRI datasets are divided into small blocks for three-dimensional reconstruction. Then, data are rearranged for further processing. The formulation of the multi-slice MRI reconstruction [18] is shown in (1).

$$\begin{aligned}&\min _{\mathbf {x}, \varvec{\alpha }_{i, j, t}}\sum _{i, j, t}\left. \Vert \mathbf {R}_{i, j, t}\mathbf {x}-\mathbf {D}\varvec{\alpha }_{i, j, t} \right. \Vert ^2_2 + \nu \left. \Vert \mathbf {F}_u\mathbf {x}-\mathbf {y} \right. \Vert ^2_2 \nonumber \\&\quad \text {s.t. } \left. \Vert \varvec{\alpha }_{i, j, t} \right. \Vert _0 \leqslant \rho \quad \forall i, j, t \end{aligned}$$
(1)

where \(\mathbf {x}\) is the unknown image to be reconstructed, \(\mathbf {R}_{i, j, t}\) is the extraction operator, indexed by the location of the lowest top-left point, \((i , j , t )\), in the image block, as shown in Fig. 1. \(\varvec{\alpha }_{i, j, t}\) is the sparse representation of \(\mathbf {x}\) under dictionary \(\mathbf {D}\), \(\left\| \cdot \right\| _0\) is the \(l_0\)-norm which counts the number of nonzero elements. \(\mathbf {F}_u\) is the undersampling Fourier matrix, \(\mathbf {y}\) is the undersampled k-space measurements, \(\rho \) is the sparsity level, and \(\nu \) is defined by \(\nu =\left. \lambda / \sigma \right. \), where \(\lambda \) is a positive constant and \(\sigma \) is the standard deviation of the noise.

Fig. 1
figure 1

a Three-dimensional MRI series volume. Multi-slice MR images are concerned. b Multi-slice MR images are divided into blocks. The lowest top-left point of a block is shown

The first term in (1) restrains the quality of sparse approximations of the image blocks with respect to dictionary \(\mathbf {D}\). The second term controls the data fidelity of the reconstruction. The reconstruction scheme consists of the following three steps:

  1. 1.

    Dictionary learning step K-SVD algorithm is used to train dual dictionaries \(D^{\text {low}}\) and \(D^{\text {high}}\).

  2. 2.

    Sparse coding step The image \(\mathbf {x}\) is assumed to be fixed and dictionary \(D^{\text {low}}\) is used to obtain the sparse representation \(\varvec{\alpha }\).

  3. 3.

    Image updating step \(D^{\text {low}}\) is replaced by \(D^{\text {high}}\). \(\varvec{\alpha }\), \(D^{\text {high}}\) are used to reconstruct and update the final images.

At the dictionary learning step, we use OMP algorithm in K-SVD [1] to compute the representation vectors for each column of signal \(\mathbf {x}\). At the sparse coding step, OMP is also utilized to obtain the sparse representations with respect to the low-resolution dictionary [18]. In addition, the time used by OMP accounts for the largest proportion of total time consumed by the reconstruction procedure. K-SVD algorithm is efficient in training adaptive dictionaries [1]. We use this algorithm to train both high-resolution and low-resolution dictionaries in our reconstruction scheme. However, K-SVD is an extremely computationally expensive iterative method. Later in this paper (please see Table 2), we can see that the execution time of K-SVD is typically more than 91 % of the total time used for reconstruction in original CPU version codes. OMP and K-SVD are the most time-consuming parts. Hence, we focus on these parts, as well as develop parallel version algorithms and codes on CUDA, to accelerate the whole reconstruction process.

3 Parallel implementations on CUDA

3.1 CUDA implementation of OMP

OMP is an attractive sparse signal recovery algorithm that can achieve good performance and is easier to implement. OMP is a kind of greedy algorithm. This algorithm chooses an atom from the dictionary at every step. The atom chosen should be the closest to the residual signal and has the largest inner product. Then, the signal is orthogonally projected to the span of selected atoms for approximation. Numerous matrix/vector operations are found in OMP algorithm, which are convenient and easy to implement in a parallel manner on GPU. Signal \(\mathbf {x}\) is a matrix with columns that correspond to the blocks shown in Fig. 1. Data in one block are rearranged to form a column vector and then stored in one column of \(\mathbf {x}\). Each column in \(\mathbf {x}\) needs a loop cycle.

OMP algorithm uses iterative mechanism. In one circle of iteration, one column of the signal \(\mathbf {x}\) is manipulated to obtain the corresponding sparse representation. In the original CUDA method, we simply parallelize the codes inside the iteration in OMP algorithm. We then develop parallel implementations of matrix/vector operations to accelerate the codes. Detailed CUDA version OMP algorithm is shown in Algorithm 1. Notably, “\(\mathbf {0}\)” represents empty set. For a index set \(\mathbf {I}=(i_1, i_2, \ldots , i_n)\), \(\varvec{\alpha }_{\mathbf {I}}\) indicates the sub-matrix of \(\varvec{\alpha }\), which consists of the columns indexed by \(\mathbf {I}\), in the order of \(i_1, i_2, \ldots , i_n\).

One important part of OMP is to calculate the pseudo-inverse of matrices. This procedure is time consuming when matrices are large. Rubinstein et al. [15] introduce a method that use Cholesky factorization to avoid computation of pseudo-inverse of matrices in OMP algorithm. We use this method to first optimize the codes algorithmically and then transfer the codes to CUDA. The modified OMP algorithm is more suitable for parallel implementation on CUDA.

figure a

Data need to be divided into many small blocks for three-dimensional MRI reconstruction. For instance, for a \(4\times 4\times 4\) block, all blocks are rearranged together by adding one column to \(\mathbf {x}\) for each block. Thus, the number of columns of \(\mathbf {x}\) may be extremely large when the slice number of the MRI data becomes larger. In the original CPU version OMP algorithm, each column of \(\mathbf {x}\) requires one circle of iteration for execution. Moreover, large number of loops are necessary and consume much time. In CUDA, we can assign one thread to execute the operations inside one loop. Numerous threads can work concurrently to simultaneously compute the results. The number of threads allocated for computation is the same with the column number of \(\mathbf {x}\). We arrange tens of thousands of threads to compute the OMP_kernel procedure in Algorithm 1. The iteration can be eliminated. In addition, all operations for each column of \(\mathbf {x}\) can run simultaneously. The iteration inside the manipulation of each column of \(\mathbf {x}\) for reaching the sparse level still exists, which can ensure the convergence for each signal component.

In the CUDA version OMP algorithm, numerous intermediate variables should be allocated space and initialized on GPU memory at the beginning. Numerous threads run simultaneously and every thread needs to visit all intermediate variables to avoid data conflict among threads. Hence, we must enlarge the size of all intermediate variables according to the thread number running at one time. For instance, if a single thread requires an intermediate variable with size 1,000 in C++ float type and 2,000 threads run simultaneously, then this intermediate variable should be allocated a memory space of 1,000 \(\times \) 2,000 at the initialization stage. Each thread can visit the corresponding memory indexed by their tid value by adding the increasing factor of the address pointer for each intermediate variable, as shown in Algorithm 1, line 2. Furthermore, intermediate variables that remain unchanged during the execution of OMP can be stored in the shared memory to further speedup the procedure. All functions called inside the OMP_kernel procedure should be modified properly and defined by the __device__ declaration specifier, which, like __global__, is part of the C extensions of CUDA [12].

3.2 CUDA implementation of K-SVD

K-SVD algorithm is utilized to train both low-resolution dictionary and high-resolution dictionary. K-SVD updates only one column of the dictionary at each circle of iteration, as well as involves only the signals that use the current atom. The detail of K-SVD is described in Reference [1]. K-SVD consists of two steps, namely, sparse coding stage that uses OMP algorithm to compute the sparse representation vectors of each column of \(\mathbf {x}\) with a fixed dictionary \(\mathbf {D}\) and codebook update stage that analyzes data in the sparse representation \(\varvec{\alpha }\) achieved in the sparse coding stage, then computes the representation error matrix and applies a singular value decomposition (SVD) to update the dictionary atom and corresponding row in \(\varvec{\alpha }\).

CUDA implementation of K-SVD is shown in Fig. 2. We adopt the CUDA version OMP method in Algorithm 1 to parallelize the sparse coding stage. The codebook update stage requires one circle of loop for each column of \(\mathbf {D}\). In each loop, the current dictionary atom and corresponding sparse representation vector are updated. The updated \(\mathbf {D}\) and \(\varvec{\alpha }\) are used in the next loop. Hence, one circle of iteration depends on the results of the previous circle. Therefore, the iteration cannot be eliminated by simply assigning many CUDA threads to run simultaneously. However, we can still accelerate K-SVD procedure by parallelizing the codes inside the loops.

Fig. 2
figure 2

K-SVD algorithm implemented on CUDA

First, we analyze the data in \(\varvec{\alpha }\) by obtaining the indices of items that use the current dictionary atom. We denote the \(k\)th column of \(\mathbf {D}\) as \(\mathbf {d}_k\), and set \(\mathbf {d}_k\) the current dictionary atom being trained. Then, we should obtain the item indices in the \(k\)th row of \(\varvec{\alpha }\) that use \(\mathbf {d}_k\), i.e., whose values are nonzero. In CUDA, we can deploy as many threads as the number of items in \(\varvec{\alpha }\) to traverse the data and simultaneously obtain the results. This procedure can be implemented before the iteration if we keep track of the row indices of \(\varvec{\alpha }\), as shown in Fig. 2. Second, computation of the representation error matrix merely consists of matrix/vector operations. Moreover, this computation is straightforward manner of parallelizing this procedure on GPU. In addition, we use CULA [13] to perform singular value decomposition of the error matrix, which is an implementation of the linear algebra package interface for CUDA-enabled GPUs. We also employ CUBLAS library to accelerate the execution of several linear algebra operations.

SVD operation in the codebook update stage demands considerable amount of computation. This operation is regarded as time consuming. Rubinstein et al. [15] develop an approximate K-SVD implementation that avoids SVD operation. We utilize their method to optimize K-SVD algorithmically and then develop CUDA version codes of K-SVD, which is part of CUDA-AO method.

When implementing K-SVD algorithm on CUDA, numerous intermediate results can be stored on the GPU memory and then used for other GPU computations. For instance, we can use the data for error matrix computation after obtaining the item indices in \(\varvec{\alpha }\) that use the current dictionary atom. Both operations executed on GPU and data are stored on GPU memory. Hence, we need not to transfer data between GPU memory and CPU memory, which requires a relatively long time and degrades the performance of our codes.

3.3 Image reconstruction using CUDA

Both low-resolution dictionary and high-resolution dictionary are trained by K-SVD algorithm implemented on CUDA, as we have mentioned in Sect. 1. To update the final results, we employ the method proposed in [18] and rewrite the codes using CUDA. First, we use the low-resolution dictionary \(D^{\text {low}}\) to obtain the sparse representation \(\varvec{\alpha }\) of MRI datasets to be rebuilt. Then, we reconstruct and update the results using \(D^{\text {high}}\) and \(\varvec{\alpha }\). The detailed reconstruction procedure is shown in Fig. 3.

Fig. 3
figure 3

Image reconstruction using CUDA

The reconstruction process requires a few loops to acquire better results, and the result of one loop iteration is further updated by the following loops. So again, the iteration cannot be eliminated using CUDA threads parallelization technique. But luckily, we only need very few iterations [18], just less than 10.

At the sparse coding step, the CUDA version OMP algorithm described in Sect. 3.1 is utilized. We can divide the datasets into many parts to obtain better parallelization effect. Every part is processed simultaneously with every CUDA thread running an OMP_kernel procedure. The results of each thread are integrated together to form the final results. If the GPU memory is insufficient to meet the requirement, a compromise solution is to execute the process in a few loops. An example would be if 150 numbers needed be processed, but the GPU memory is capable of dealing with no more than 100 numbers. Clearly, not all data can be processed at the same time. Thus, we divide the 150 numbers into two parts, dealing with only 75 numbers once. The task will be finished within two circles of loops. For implementation, the GPU’s memory consists of 3,072 megabytes (MB). At the sparse coding step, dictionary size is set to 600. For a \(256\,{\times }\,256\,{\times }\,28\) MR dataset, after it is rearranged by \(4\,{\times }\,4\,{\times }\,4\) block, a new matrix with a size of 64 \(\times \) 1,835,008 is obtained, and the size of the sparse representation coefficient matrix is 600 \(\times \) 1,835,008, which requires more than 4,400 MB of memory for float type data. The memory requirement will be even larger if other intermediate variables are considered. Obviously, the GPU memory cannot meet this demand. This problem can be overcome by executing the process in a few loops. If the number of loops is set to 4, only 1,835,008/4 = 458,752 columns (requiring approximately 1,100 MB of memory) need to be processed in each loop, and the memory requirement is reduced to a quarter of the original. In this case, the GPU memory can handle the requirement.

4 Results

We use CUDA 5.5 to develop our programs on GPU. All computations are performed on a 64-bit computer with Intel Xeon E5640 2.67 GHz CPU and NVIDIA GeForce GTX 780 TI GPU under Windows Server 2008 R2 operating system. The experimental computer has two packaged-together Xeon E5640 2.67 GHz CPUs, thus it has 8 cores and 16 threads. All codes are written with C++ and CUDA C language with Microsoft Visual Studio 2010 integrated development environment (IDE). Our host PC has 48 gigabytes (GB) of memory, whereas the memory size of GPU is 3,072 MB. We test the bandwidth of GPU using the sample utility “bandwidthTest” in the NVIDIA GPU Computing Toolkit. The result shows host-to-device bandwidth is 4.25 GB/s, the device-to-host bandwidth is 5.59 GB/s, and the device-to-device bandwidth is 224.34  GB/s, as shown in Fig. 4. All codes are compiled in 64-bit version within Microsoft Visual Studio 2010 under release mode. Complier options are set to maximize speed, enabling intrinsic functions and enabling function-level linking. We mainly use the NVIDIA Visual Profiler to record the running time of the codes. Timing operations on the host are also used.

Fig. 4
figure 4

Bandwidth of NVIDIA GeForce GTX 780 TI

We test the acceleration results with different numbers of slices on MRI datasets. Table 1 shows the time consumed by OMP algorithm at sparse coding step and its speedup (Fig. 5a shows a more intuitional view). When implemented on CUDA, numerous threads work simultaneously and the time consumed is reduced. The amount of data and the time consumed by CPU codes increase rapidly if the number of MRI slices become larger. The time consumed by OMP increases from 38.97s to 1,299.99s when the number of slices changes from 4 to 28. However, Original CUDA method needs 221.69s and CUDA-AO method needs only 47.07s with 28 MRI slices. This indicates that CUDA can greatly accelerate the procedure. CUDA-AO method avoids calculating the pseudo-inverse of matrices and is more suitable for parallelization. Hence, CUDA-AO method achieves better acceleration effect. However, the speedup obtained by OMP is not so significant when compared with K-SVD (Table 2), because the iteration for reaching the sparse level still exists to ensure the convergence for each signal component and this will increase the time overhead.

Fig. 5
figure 5

Speedup with different numbers of MRI slices versus original CPU codes: a OMP algorithm at sparse coding step; b K-SVD algorithm and c total reconstruction procedure

Table 1 Execution time and speedup of OMP algorithm at sparse coding step
Table 2 Execution time and speedup of K-SVD algorithm

Table 2 shows the execution time of K-SVD algorithm in original CUDA method and CUDA-AO method. The results indicate great acceleration. After algorithmic optimization, SVD is avoided and the method is more suitable for parallelization using CUDA. Hence, CUDA-AO method achieves much better acceleration effect than original CUDA method. We can see that when the slice number is 24, original CUDA method can obtain approximately 305 times of speedup, while approximately 1,912 times of speedup can be achieved if algorithmic optimization is implemented. As shown in Fig. 5b, the speedup increases significantly with the increase in the number of MRI slices. The speedup is mostly obtained in the codebook update stage shown in Fig. 2. We deploy many CUDA threads to traverse the data and simultaneously obtain the results. Many intermediate variables can be stored in GPU memory for further use and need not to be transferred to CPU memory. This design dramatically reduces the time consumed.

Table 3 shows the execution time of original CPU codes, original CUDA method, and CUDA-AO method. We can see that with the increase in the number of MRI slices, the speedup becomes larger, as shown in Fig. 5c. The reconstruction procedure of MRI can obtain more than 20 times of speedup using original CUDA method. The acceleration effect is even better when using CUDA-AO method, which holds approximately 324 times of speedup when the number of slices is 24. The results indicate that both original CUDA method and CUDA-AO method can achieve good effect of speedup. Moreover, CUDA-AO method is better. Hence, optimizing the method algorithmically makes the technique more suitable for parallelization on GPU.

Table 3 Execution time and speedup obtained with original CUDA method and CUDA-AO method

CUDA-AO method exhibits superior qualities over OpenMP codes, particularly when the MR data volume is large, as shown in Fig. 6. Although OpenMP codes utilize nearly the same parallelization mechanism as CUDA-AO, the performance of the former is limited by the computing capability of CPU and is therefore much less effective than CUDA-AO method. When the MR slice number is 28, we can see that more than 300 times of speedup is acquired by CUDA-AO method versus OpenMP for the K-SVD algorithm. Meanwhile, the total reconstruction time of CUDA-AO is approximately 1/48 of that of OpenMP codes.

Fig. 6
figure 6

Speedup with different numbers of MRI slices versus OpenMP codes: a OMP algorithm at sparse coding step; b K-SVD algorithm and c total reconstruction procedure

For CUDA-AO method, when the number of MRI slices is small, for instance, 4 or 8, the speedup line is low and the acceleration effect is relatively not obvious, as shown in Fig. 5. With 4 or 8 MRI slices, the amount of data is not overly large and CPU has strong computing power. Therefore, the performance gap between CPU and GPU is not substantial. However, when the amount of data is increased to a certain degree, for instance, 16 MRI slices, the speed-up line becomes rather high and the acceleration effect becomes extremely obvious. These findings indicate that CUDA is more capable of dealing with massive data and huge computation. If the number of slices is further increased, the acceleration effect would be limited by the performance of the graphics card and the increase in the speed-up line would be insignificant. The data transfer overheads between GPU and CPU can be easily recorded using NVIDIA Visual Profiler. The GPU–CPU time for CUDA-AO method is shown in Table 4, which indicates that the time consumed by GPU–CPU data transfer is relatively short, and that time gradually increases as MRI data volumes increase. Note that with less data, the GPU–CPU data transfer time accounts for a relatively large ratio of the total reconstruction time. This ratio decreases as the data volume becomes larger.

Table 4 Data transfer time (seconds) between GPU and CPU for CUDA-AO method

5 Conclusion

In this work, we accelerate the reconstruction of MRI by three-dimensional dual-dictionary learning using CUDA. The parallel algorithms on GPU and acceleration performance are investigated. We develop the following two versions of CUDA codes: (1) original CUDA method that directly transfer original CPU codes to CUDA; (2) CUDA-AO method that first improves the original CPU codes with algorithmic optimization, then implements the codes on CUDA. Parallel codes on CPU are also developed using OpenMP. Experiments show that approximately 94 times of speedup is achieved using original CUDA method when the number of slices is 24, while roughly 324 times of speedup is obtained with CUDA-AO method. When compared with OpenMP, CUDA-AO method can acquire more than 40 times of speedup.

Our methods have achieved great success for the K-SVD algorithm. This indicates that for some iterative algorithm, when the iteration can’t be eliminated, we can still acquire acceleration by parallelizing the operations inside the loops. Plenty of CUDA threads run simultaneously for the OMP algorithm. Moreover, each thread implements a single OMP procedure individually. Our CUDA implementations extraordinarily reduce the time consumed by the total reconstruction procedure.

This work shows that CUDA and algorithmic optimization offer great advantages in accelerating MRI reconstruction. This is an effective way to put long time-consuming algorithms to practical and clinical use.