Introduction

Digital volume correlation (DVC) has been adopted as an effective technique for determining internal volumetric deformation within solid materials since it was proposed by Bay and Smith [1, 2]. The DVC technique tracks small motion of points of interest (POIs) by minimizing the difference between the POI-centric subvolumes in the reference (un-deformed) volume image and the target (deformed) volume image acquired using 3D imaging techniques such as X-ray computer tomography (CT). This technique can be considered as a straightforward extension of the well-established digital image correlation (DIC) [3, 4] and shares its simplicity in principles and effectiveness in applications. Nowadays, it has been extensively applied in the characterization of various materials including bones [1, 5], soft materials [6, 7], wood [8] and sand [9]. Compared with DIC, the computational burden of DVC is much heavier. Thus high computation speed of DVC has become one of main issues in this area that challenges researchers in the past decade.

A DVC algorithm can be performed at two levels: integer-voxel level and sub-voxel level. For integer-voxel displacement estimation, fast Fourier transform based cross-correlation (FFT-CC) algorithm [6, 10] is widely used as a classic method, which benefits from the fact that cross-correlation operation in space domain is equivalent to point-wise multiplication in frequency domain. The algorithm can be further accelerated by combining a fast sum-table approach [7]. To achieve a sub-voxel level accuracy, various sub-voxel registration algorithms have been developed, including the iterative algorithms derive from Newton’s minimization method, e.g., Levenberg-Marquardt algorithm [1], Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm [2, 11, 10], Newton–Raphson algorithm [7] and iterative least-squares algorithm [12], and non-iterative algorithms such as the correlation coefficients curve-fitting algorithm [6] and the gradient-based algorithm [7]. Recently, Pan et al. introduced three ideas to accelerate the computation of DVC effectively [13] : (1) an inverse compositional Gauss-Newton (ICGN) algorithm [14] was introduced to replace the conventional forward additive Newton–Raphson algorithm for sub-voxel registration. This algorithm eliminates the repeated updates of the Hessian matrix during the iterative procedure; (2) a reliability-guided displacement tracking strategy [15] was employed to provide fast and accurate initial guess for the ICGN algorithm; (3) a global look-up table of cubic interpolation coefficients [16] was pre-established to eliminate the redundant calculation of sub-voxel intensity interpolation coefficients for the construction of warped target image during the iterative procedure. By these means, the computation speed of DVC can surge up from about 0.9 points of interest per second (POI/s) to 41.1 POI/s when using a 19 × 19 ×19  -voxel subvolume [13].

The computation speed of DVC could be substantially increased further by processing POIs simultaneously on parallel computing device. However, the application of parallel computing technology requires the independence of calculation at each POI, which conflicts with any path-dependent processing strategy leading to a sequential computation scheme. To overcome this difficulty, our preliminary research proposed a path-independent DIC method [17], which estimated the initial guess for the ICGN algorithm at each POI independently using the FFT-CC algorithm. Elimination of path-dependency allows the application of parallel computing technology to the proposed DIC method. It was found that the parallel computing-powered DIC (paDIC) method, implemented on NVIDIA compute unified device architecture (CUDA) for graphics processing unit (GPU) devices, can reach a speedup of nearly two orders of magnitude without sacrificing high accuracy and precision [18].

Recently, effort has been made to accelerate DVC using GPU devices. Bar-Kochba et al. [19] proposed a DVC method that iteratively compares the reference volume image and the target volume image through FFT-CC algorithm and then adjusts the two images until the displacement increment obtained in iteration reaches a sufficiently small value. In their implementation, the FFT-CC computation at POIs can be performed in parallel on GPU or sequentially on CPU. The GPU implementation gained a speed improvement of about 5.5 times over its CPU counterpart. Gates et al. [10] combined FFT-CC (for coarse search) and BFGS algorithm (for sub-voxel registration), and provided an implementation of their DVC method in a hybrid multi-core CPU and multiple GPU mode. The GPUs were used to estimate the normalized cross-correlation objective function and its gradient at POIs in a refined parallel manner, and the multi-core CPU carries out the BFGS iteration accordingly. In their experiments, the program based on the collaboration of 3 GPUs and a 12-core CPU can be eight times faster than the one running merely on CPU. In addition, GPU was also employed to speed up the computation for a voxel-based global DVC method [20], by solving the resulting system including millions of degrees of freedom.

In this paper, we extend the paDIC to the parallel DVC (paDVC) on GPU devices and assess its performance using computer simulated 3D speckle images. The implementation of the proposed paDVC includes coarse-grained and fine-grained parallelism for all the three major parts, namely FFT-CC algorithm, ICGN algorithm and the precomputation procedure which can further accelerate the ICGN algorithm. The rest of the paper is organized as follows. Section 2 briefly explains the principle of the proposed path-independent DVC algorithm. Subsequently, section 3 describes the implementation of the paDVC on NVIDIA CUDA. In section 4, the accuracy, precision and computational efficiency of the paDVC are verified. Section 5 concludes the paper.

Principle of the paDVC

DVC tracks the displacement from a POI P(x 0y 0z 0) in the reference volume image (Fig. 1(a)) to a point P′(x 0 y 0 z 0 ) in the target volume image (Fig. 1(b)). In order to avoid mis-registration, a cubic subvolume centred at P(x 0y 0z 0) is selected as the basic matching unit.

Fig. 1
figure 1

Schematic illustration of the principle of DVC: the displacement vector d = (u, v, w)T from a POI P(x 0y 0z 0) in the reference volume image to P′(x 0 y 0 z 0 ) in the target volume image is obtained by tracking the corresponding cubic subvolumes. Q(x i , y i , z i ) and Q′(x i , y i , z i ) indicate the points within the reference subvolume and the target subvolume, respectively

The proposed paDVC can be summarized as a three-step procedure, as illustrated in Fig. 2. The integer-voxel registration is carried out using the 3D FFT-CC algorithm at every POI. The obtained integer-voxel deformation vector p0 = (u, 0, 0, 0, v, 0, 0, 0, w, 0, 0, 0)T is fed as the initial guess into the 3D ICGN algorithm for the sub-voxel registration. Prior to these two steps, the inverse of the Hessian matrix for each reference subvolume is precomputed and a global look-up table of tri-cubic interpolation coefficients for the target volume image is constructed. This precomputation step helps to eliminate a large number of redundant calculations during iterations of the 3D ICGN algorithm.

Fig. 2
figure 2

Flow chart of the proposed paDVC, where R and T refer to the reference subvolume and target subvolume, respectively, ξ is the local coordinates of points within the subvolume

Integer-Voxel Registration by the 3D FFT-CC Algorithm

Defining a POI-centric subvolume with N = (2M + 1) × (2M + 1) × (2M + 1) voxels, where N and M are integers, the displacement vector d = (u, v, w)T can be obtained by locating the peak of the zero-mean normalized cross-correlation (ZNCC) coefficients:

$$ {C}_{ZNCC}\left(u,v,w\right)=\frac{{\displaystyle \sum_i\left({R}_i-{R}_m\right)\left({T}_i-{T}_m\right)}}{\sqrt{{\displaystyle \sum_i{\left({R}_i-{R}_m\right)}^2{\displaystyle \sum_i{\left({T}_i-{T}_m\right)}^2}}}} $$
(1)

where R i and T i are the grey intensity values of the ith voxel in the reference subvolume and the target subvolume, respectively; \( {R}_m=\frac{1}{N}{\displaystyle \sum_{i=0}^{N-1}{R}_i} \) and \( {T}_m=\frac{1}{N}{\displaystyle \sum_{i=0}^{N-1}{T}_i} \) refer to the mean intensity values within the two subvolumes. According to the Fourier theory, the calculation of C ZNCC (u, v, w) in space domain can be performed as a simple point-wise multiplication in frequency domain, i.e.,

$$ {C}_{ZNCC}\left(u,v,w\right)=FF{T}^{-1}\left\{FF{T}^{*}\left[{\displaystyle \sum_i\frac{R_i-{R}_m}{\sqrt{{\displaystyle \sum_i{\left({R}_i-{R}_m\right)}^2}}}}\right]\cdot FFT\left[{\displaystyle \sum_i\frac{T_i-{T}_m}{\sqrt{{\displaystyle \sum_i{\left({T}_i-{T}_m\right)}^2}}}}\right]\right\} $$
(2)

where FFT and FFT − 1 are the fast Fourier transform and inverse fast Fourier transform, respectively, and the superscript “*” denotes the complex conjugate. The integer-voxel displacement vector d = (u, v, w)T is determined from the peak of C ZNCC (u, v, w) and then transformed to an integer-voxel deformation vector p0 = (u, 0, 0, 0, v, 0, 0, 0, w, 0, 0, 0)T.

Sub-Voxel Registration by the 3D ICGN Algorithm

Assuming that a point Q(x i , y i , z i ) in the reference subvolume R deforms to a point Q′(x i , y i , z i ) in the target subvolume T (see Fig. 1), a first-order shape function is selected to describe the relationship between Q and Q′ as

$$ \begin{array}{l}{x}_i^{\prime }={x}_i+u+{u}_x\varDelta x+{u}_y\varDelta y+{u}_z\varDelta z\\ {}{y}_i^{\prime }={y}_i+v+{v}_x\varDelta x+{v}_y\varDelta y+{v}_z\varDelta z\\ {}{z}_i^{\prime }={z}_i+w+{w}_x\varDelta x+{w}_y\varDelta y+{w}_z\varDelta z\end{array} $$
(3)

where u x , u y , u z , v x , v y, v z , w x , w y and w z are gradients of u, v and w with respect to x, y and z axis; Δx = x i  − x 0, Δy = y i  − y 0 and Δz = z i  − z 0 are the local coordinates of point Q with respect to the position of POI P in the reference subvolume. Equation (3) can be rewritten as

$$ {\mathbf{Q}}^{\mathbf{\prime}}=\mathbf{P}+\mathbf{W}\left(\boldsymbol{\upxi}; \mathbf{p}\right) $$
(4)

where W(ξ; p) is the warp function,

$$ \mathbf{W}\left(\boldsymbol{\upxi}; \mathbf{p}\right)=\left[\begin{array}{cccc}\hfill 1+{u}_x\hfill & \hfill {u}_y\hfill & \hfill {u}_z\hfill & \hfill u\hfill \\ {}\hfill {v}_x\hfill & \hfill 1+{v}_y\hfill & \hfill {v}_z\hfill & \hfill v\hfill \\ {}\hfill {w}_x\hfill & \hfill {w}_y\hfill & \hfill 1+{w}_z\hfill & \hfill w\hfill \\ {}\hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 1\hfill \end{array}\right]\left[\begin{array}{c}\hfill \varDelta x\hfill \\ {}\hfill \varDelta y\hfill \\ {}\hfill \varDelta z\hfill \\ {}\hfill 1\hfill \end{array}\right] $$
(5)

where ξ = (Δx, Δy, Δz, 1)T represents the local coordinates of point Q within the reference subvolume, and p = (u, u x , u y , u z , v, v x , v y , v z , w, w x , w y , w z )T represents the vector of deformation parameters. Besides, the incremental warp function W(ξ; Δ p) used to adjust the shape of the reference subvolume can be written as

$$ \mathbf{W}\left(\boldsymbol{\upxi}; \varDelta \mathbf{p}\right)=\left[\begin{array}{cccc}\hfill 1+\varDelta {u}_x\hfill & \hfill \varDelta {u}_y\hfill & \hfill \varDelta {u}_z\hfill & \hfill \varDelta u\hfill \\ {}\hfill \varDelta {v}_x\hfill & \hfill 1+\varDelta {v}_y\hfill & \hfill \varDelta {v}_z\hfill & \hfill \varDelta v\hfill \\ {}\hfill \varDelta {w}_x\hfill & \hfill \varDelta {w}_y\hfill & \hfill 1+\varDelta {w}_z\hfill & \hfill \varDelta w\hfill \\ {}\hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 1\hfill \end{array}\right]\left[\begin{array}{c}\hfill \varDelta x\hfill \\ {}\hfill \varDelta y\hfill \\ {}\hfill \varDelta z\hfill \\ {}\hfill 1\hfill \end{array}\right] $$
(6)

where Δ p = (Δu, Δu x , Δu y , Δu z , Δv, Δv x , Δv y , Δv z , Δw, Δw x , Δw y , Δw z )T is the incremental vector of deformation parameter. The incremental vector Δ p can be determined by the 3D ICGN algorithm which minimizes the zero-mean normalized sum of squared difference (ZNSSD) criterion,

$$ {C}_{ZNSSD}\left(\varDelta \mathbf{p}\right)={{\displaystyle \sum_{\xi}\left\{\frac{R\left[\mathbf{P}+\mathbf{W}\left(\boldsymbol{\upxi}; \boldsymbol{\Delta} \mathbf{p}\right)\right]-{R}_m}{\sqrt{{\displaystyle \sum_{\xi }{\left\{R\left[\mathbf{P}+\mathbf{W}\left(\boldsymbol{\upxi}; \boldsymbol{\Delta} \mathbf{p}\right)\right]-{R}_m\right\}}^2}}}-\frac{T\left[\mathbf{P}+\mathbf{W}\left(\boldsymbol{\upxi}; \mathbf{p}\right)\right]-{T}_m}{\sqrt{{\displaystyle \sum_{\xi }{\left\{T\left[\mathbf{P}+\mathbf{W}\left(\boldsymbol{\upxi}; \mathbf{p}\right)\right]-{T}_m\right\}}^2}}}\right\}}}^2 $$
(7)

Then, Δ p can be solved through

$$ \varDelta \mathbf{p}={\mathbf{H}}_{\mathbf{12x12}}^{-\mathbf{1}}{\displaystyle \sum_{\xi}\left\{{\left(\nabla R\frac{\partial \mathbf{W}}{\partial \mathbf{p}}\right)}_{12x1}^T\left[\frac{\sqrt{{\displaystyle \sum_{\xi }{\left\{R\left[\mathbf{P}+\mathbf{W}\left(\boldsymbol{\upxi}; \boldsymbol{\Delta} \mathbf{p}\right)\right]-{R}_m\right\}}^2}}}{\sqrt{{\displaystyle \sum_{\xi }{\left\{T\left[\mathbf{P}+\mathbf{W}\left(\boldsymbol{\upxi}; \mathbf{p}\right)\right]-{T}_m\right\}}^2}}}\left(T\left[\mathbf{P}+\mathbf{W}\left(\boldsymbol{\upxi}; \mathbf{p}\right)\right]-{T}_m\right)-\left(R\left[\mathbf{P}+\mathbf{W}\left(\boldsymbol{\upxi}; \boldsymbol{\Delta} \mathbf{p}\right)\right]-{R}_m\right)\right]\right\}} $$
(8)

where ∇R is the gradient within the reference subvolume,

$$ \nabla R=\left(\frac{\partial R\left(x,y,z\right)}{\partial x},\frac{\partial R\left(x,y,z\right)}{\partial y},\frac{\partial R\left(x,y,z\right)}{\partial z}\right) $$
(9)

\( \frac{\partial \mathbf{W}}{\partial \mathbf{p}} \) is the Jacobian matrix of the warp function,

$$ \frac{\partial \mathbf{W}}{\partial \mathbf{p}}=\left[\begin{array}{cccccccccccc}\hfill 1\hfill & \hfill \varDelta x\hfill & \hfill \varDelta y\hfill & \hfill \varDelta z\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill \\ {}\hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 1\hfill & \hfill \varDelta x\hfill & \hfill \varDelta y\hfill & \hfill \varDelta z\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill \\ {}\hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 1\hfill & \hfill \varDelta x\hfill & \hfill \varDelta y\hfill & \hfill \varDelta z\hfill \end{array}\right] $$
(10)

and H − 112 × 12 denotes the inverse of the 12 × 12 Hessian matrix H,

$$ \mathbf{H}={\displaystyle \sum_{\xi}\left\{\left[{\left(\nabla R\frac{\partial \mathbf{W}}{\partial \mathbf{p}}\right)}_{12\times 1}^T{\left(\nabla R\frac{\partial \mathbf{W}}{\partial \mathbf{p}}\right)}_{1\times 12}\right]\right\}} $$
(11)

The obtained Δ p is then substituted back into equation (6) to yield the incremental warp function W( ξ; Δ p) which is used to update the warp function W( ξ; p) according to

$$ \mathbf{W}\left(\boldsymbol{\upxi}; \mathbf{p}\right)\leftarrow \mathbf{W}\left[{\mathbf{W}}^{-1}\left(\boldsymbol{\upxi}; \boldsymbol{\Delta} \mathbf{p}\right),\mathbf{p}\right]=\mathbf{W}\left(\boldsymbol{\upxi}; \mathbf{p}\right){\mathbf{W}}^{-1}\left(\boldsymbol{\upxi}; \boldsymbol{\Delta} \mathbf{p}\right) $$
(12)

where

$$ {\mathbf{W}}^{-1}\left(\boldsymbol{\upxi}; \varDelta \mathbf{p}\right)={\left[\begin{array}{cccc}\hfill 1+\varDelta {u}_x\hfill & \hfill \varDelta {u}_y\hfill & \hfill \varDelta {u}_z\hfill & \hfill \varDelta u\hfill \\ {}\hfill \varDelta {v}_x\hfill & \hfill 1+\varDelta {v}_y\hfill & \hfill \varDelta {v}_z\hfill & \hfill \varDelta v\hfill \\ {}\hfill \varDelta {w}_x\hfill & \hfill \varDelta {w}_y\hfill & \hfill 1+\varDelta {w}_z\hfill & \hfill \varDelta w\hfill \\ {}\hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 1\hfill \end{array}\right]}^{-1}\left[\begin{array}{c}\hfill \varDelta x\hfill \\ {}\hfill \varDelta y\hfill \\ {}\hfill \varDelta z\hfill \\ {}\hfill 1\hfill \end{array}\right] $$
(13)

represents the inverse of the incremental warping function. This procedure is repeated until one of the following two pre-imposed convergence conditions is satisfied, i.e., \( \sqrt{{\left(\varDelta u\right)}^2+{\left(\varDelta v\right)}^2+{\left(\varDelta w\right)}^2}<0.001 \) or the maximum iteration number reaches 20.

It is clear that the Hessian matrix H only depends on the gradient ∇R in the reference subvolume and the Jacobian matrix of the warp function \( \frac{\partial \mathbf{W}}{\partial \mathbf{p}} \), thus H and its inverse can be precomputed, which makes the ICGN algorithm faster than the traditional forward additive Newton–Raphson algorithm.

During the iterations, the reconstructed T[P + W(ξ; p)] in equation (8) often locates at a sub-voxel location (see point C in Fig. 3). Therefore, an interpolation scheme is applied to estimate the intensity at sub-voxel locations. In DIC bicubic spline interpolation is usually preferable to polynomial interpolation because it can give more accurate results [21]. However, Tri-cubic interpolation is chosen in the proposed DVC for two reasons: i) The amount of computation in DVC can be several orders of magnitude higher than that of DIC, which makes the computation speed a critical requirement. In this context, the significant speedup gained by using more computationally efficient tri-cubic interpolation, which in particular allows a precomputed global look-up table of interpolation coefficients that can further reduce the interpolation times by orders of magnitude [16], could be highly beneficial to the proposed DVC method at a price of small (sometimes imperceptible) loss of accuracy; ii) Based on our preliminary of the path-independent DIC [17], the proposed DIC method, using bicubic interpolation, gave an excellent accuracy and precision, which is comparable with the Newton–Raphson algorithm using bicubic spline interpolation on similar speckle images reported in Ref. [22]. Moreover, it was found in real experiments that the difference in the bias error of the Newton–Raphson algorithm using quantic spline interpolation and bicubic interpolation can be reduced to a negligible level if a proper Gaussian pre-filtering is performed on the noise contaminated speckle images [23]. Therefore, a tri-cubic interpolation scheme with global look-up table is considered as a good trade-off between computational efficiency and registration accuracy for the paDVC.

Fig. 3
figure 3

Diagram of the tri-cubic interpolation at a sub-voxel position surrounded by 8 voxels with integer coordinates

Implementation of the paDVC on CUDA

Mapping from CUDA Programming Model to NVIDIA GPU Hardware

Figure 4 illustrates the mapping from the CUDA programming model to NVIDIA GPU hardware. In the CUDA programming model, a parallel problem is first split into coarse sub-problems that can be solved independently by blocks, and each sub-problem is further divided into finer pieces that can be solved cooperatively by all the threads within the block. In the architecture of NVIDIA GPU hardware, the blocks are mapped as components on streaming multiprocessors (SMs).

Fig. 4
figure 4

Mapping from the CUDA programming model to NVIDIA GPU hardware

Parallelization of Precomputation

The gradients ∇R and ∇T used in building the inverse of the Hessian matrix in the reference subvolume and the global look-up table of tri-cubic interpolation coefficients for the target volume image are calculated according to the central differences approximation,

$$ \begin{array}{l}\frac{\partial R\left(x,y,z\right)}{\partial x}=\frac{R\left(x+1,y,z\right)-R\left(x-1,y,z\right)}{2},\kern0.75em \frac{\partial T\left(x,y,z\right)}{\partial x}=\frac{T\left(x+1,y,z\right)-T\left(x-1,y,z\right)}{2}\\ {}\frac{\partial R\left(x,y,z\right)}{\partial y}=\frac{R\left(x,y+1,z\right)-R\left(x,y-1,z\right)}{2},\kern0.75em \frac{\partial T\left(x,y,z\right)}{\partial y}=\frac{T\left(x,y+1,z\right)-T\left(x,y-1,z\right)}{2}\\ {}\frac{\partial R\left(x,y,z\right)}{\partial z}=\frac{R\left(x,y,z+1\right)-R\left(x,y,z-1\right)}{2},\kern0.75em \frac{\partial T\left(x,y,z\right)}{\partial z}=\frac{T\left(x,y,z+1\right)-T\left(x,y,z-1\right)}{2}\end{array} $$
(14)

Equation (14) indicates that the computation of the gradient at a voxel involves the intensity values at 3 × 3 × 3 voxels. In this work, instead of processing the gradient at each voxel individually, we introduce a tiled algorithm [24] to calculate the gradients in groups of 6 × 6 × 6 voxels using blocks containing 8 × 8 × 8 threads. By this way, an optimal occupancy of SMs can be achieved and the latency caused by intensive accesses to global memory can be minimized. During the precomputation of gradients for every 6 × 6 × 6 voxels, the intensity values at 8 × 8 × 8 voxels are loaded into the shared memory allocated to a block, as shown in Fig. 5. Then the calculation of gradients at 216 voxels is performed by 216 threads in parallel according to equation (14). It is noteworthy that 296 (=512–216) threads around the 216-thread tile are only used to store the intensity values at the voxels along the outer boundary and do not perform any calculation. In this trade-space-for-time strategy, all the data required to calculate the gradients at each voxel in the 6 × 6 × 6-voxel unit along the three axes are loaded only once to minimize the access of global memory, which is generally quite time consuming. As the precomputation takes a very small portion of the overall computation time (~3 %, according to Tables 2 and 3 in Section 4.3), this waste of threads only leads to trivial loss of computational efficiency.

Fig. 5
figure 5

Illustration of the tiled computation for volume image gradients

The tri-cubic interpolation used in equation (8) is defined as

$$ T\left(x,y,z\right)={\displaystyle \sum_{i=0}^3{\displaystyle \sum_{j=0}^3{\displaystyle \sum_{k=0}^3{\alpha}_{ijk}{x}^i{y}^j{z}^k}}} $$
(15)

where the 64 coefficients α ijk are obtained according to the intensity values at its eight surrounding integer voxels as well as their gradients (see Fig. 3). A look-up table of interpolation coefficients α ijk can be computed using the similar “tiled” strategy, in which each 8 × 8 × 8-thread block is assigned to calculate the interpolation coefficients of a 7 × 7 × 7 eight-voxel-unit. The constructed look-up table is then stored in global memory and will be accessed during the 3D ICGN algorithm.

Parallelization of 3D FFT-CC Algorithm

The 3D FFT-CC algorithm contains two steps: (a) calculating C ZNCC (u, v, w) using FFT; and (b) searching the peak of C ZNCC (u, v, w).

The calculation of cross correation is accelerated by CUDA FFT (CUFFT) library [25]. To achieve high utilization efficiency of GPU hardware, subvolumes are grouped in batches before they are transferred to the processing line. Although it is not guarenteed that all of the subvolumes are handled simultaneously due to hardware limitation, CUFFT automatically ensures a maximum number of subvolumes being processed in parallel.

The peak searching within each subvolume is also accelerated by parallel computing technology. First, the 3D voxel matrix containing p 3 voxels is transformed into a 1D array. The 3D index (u, v, w) is coded as a 1D index i through

$$ i=\left(w\times p+v\right)\times p+u $$
(16)

which can be decoded according to

$$ \begin{array}{ccc}\hfill u=i\ \mod\ p,\hfill & \hfill v=\left(i/p\right) \mod\ p,\hfill & \hfill w=\left(i/p\right)/p\hfill \end{array} $$
(17)

Second, a block containing m threads (m = 256 in this work) is assigned to handle the subvolume. As in most case p 3 is greater than the thread number in a block, the p 3 C zncc values and their 1D indices are divided into m groups. Each group is processed by a thread within the block to find the maxium C zncc value in the group through a simple maximum element searching algorithm. Third, a classic parallel reduction algorithm [26] is applied to locate the peak C zncc among the m C zncc values obtained by the threads, as illustrated in Fig. 6. In each reduction step, the first half of the threads compare their C zncc values with those in the second half of the threads. If the C zncc of any thread in the first half is smaller than that of its counterpart in the second half, the two threads exchange their data, viz. C zncc values and 1D indices. In the next reduction step, the comparison and data exchange only occur in the active half of the threads that containing larger C zncc values. This procedure continues until there remains only one active thread. Finally, the 3D index of the peak C zncc in a subvolume can be located by decoding the remained 1D index according to equation (17).

Fig. 6
figure 6

Flow chart of the parallel reduction algorithm within an individual block

Compared with the sequential searching algorithm, which results in a computation complexity of O(p 3) for a (p × p × p)-voxel subvolume, the parallel reduction algorithm lowers the computation complexity down to O(p 3/m).

Parallelization of the 3D ICGN Algorithm

Figure 7 illustrates the parallelization of the 3D ICGN algorithm. The computation at each POI is carried out by a block containing m threads (m = 256 in this work). Initially, the integer-voxel deformation vector estimated using the 3D FFT-CC algorithm is loaded into the shared memory of the block by one thread. Then, all the N threads within the block participate in the construction of the warped target subvolume T[P + W(ξ; p)], referring to the look-up table of tri-cubic interpolation coefficients. Afterwards, 12 threads are involved in updating the warp function W(ξ; p) according to equation (12), while the other threads become inactive. Finally, one thread remains active to check whether any of the two convergence conditions is satisfied and output the sub-voxel deformation in case the iteration is completed.

Fig. 7
figure 7

Schematic flow chart of the computation procedure within one thread block of the 3D ICGN algorithm in the paDVC

Batch Processing Mechanism

It would be ideal that all the POIs can be processed simultaneously on GPU. However, the degree of parallelism is restricted in practice due to the limited hardware capability, namely limited computing cores and memory on a GPU device. In particular, for a middle size volume image containing 512 × 512 × 512 voxels, the data generated during the DVC method can be tens of gigabytes, which is even far more than the primary memory equipped in a desktop computer. The paDVC, therefore, is implemented in a batch processing manner. It means that the volume images are first divided into cubes. And then the cubes are processed in turn, namely the POIs distributed in the whole volume image are processed in batches of a certain number. The batch size (or cube size) is variable, depending on the spatial sampling rate (viz. number of POIs within the cube) and the size of subvolume configured in the experiment. A basic rule in the batch planning is to achieve a full use of the available global memory on graphics card, and to ensure all the data required during the computation for a batch of POIs can be loaded into the global memory only once. In this work, the batch size is roughly estimated and set in the program. A dynamic and precise batch planning will be developed in the future. It is noteworthy that within the cube the calculation at POIs is also carried out in batches of an integer N, where N is dependent on the available blocks for the parallel computation in a GPU (see Fig. 4).

Obviously, the efficiency of the paDVC may be influenced to some extent by this batch processing mechanism due to more exchange of data between the global memory on graphics card and the primary memory on mainboard, and slightly redundant computation to treat the voxels along the boundary of cubes. However, this mechanism offers the paDVC a good adaptability to large scale of data and flexible computing power of GPU devices. An immediate boost of performance can be achieved by increasing the POI number in each batch when using a high-end graphics card equipped with more computing cores and memory.

Experimental Verification

Experimental Specification

The proposed paDVC method is programmed on CUDA 6.5 using C++ language, and tested on a desktop computer equipped with an Intel® Xeon® CPU E5-1650 (6 cores, 3.20 GHz main frequency), 16.0 GB RAM and a NVIDIA GeForce GTX 680 graphics card (8 SMs with 1536 CUDA cores and 2GB 256-bit RAM). It is worth mentioning that, although a low-end GPU is used in this work, the paDVC is applicable to any platforms supporting CUDA, and higher performance in terms of computation speed can be expected on more powerful platforms.

In order to quantitatively evaluate the accuracy and precision of the paDVC method, an 8-bit grayscale reference volume image R(x, y, z) (see Fig. 8) is generated according to Ref. [13]:

$$ R\left(x,y,z\right)=\mathrm{round}\left\{{\displaystyle \sum_{i=1}^s{I}_i \exp}\left[-\frac{{\left(x-{x}_i\right)}^2+{\left(y-{y}_i\right)}^2+{\left(z-{z}_i\right)}^2}{r^2}\right]\right\} $$
(18)

where s represents the total number of speckles, I i and (x i , y i , z i ) represents the random peak intensity value of the ith speckle and its center position, respectively. r is the radius of the speckle. Function round(x) returns the nearest integer to x. Afterwards, ten target volume images are generated by translating R(x, y, z) in the Fourier domain according to the shift theorem [21], with pre-set sub-voxel displacement along z-axis ranging from 0 to 1 voxel. The step between every two successive images is set to be 0.1 voxels. To simulate the influence of noise in real experiments, a white Gaussian noise field with a zero mean value and a variance of 4 gray levels is then superposed on these volume images. In this work, 11 volume images with a dimension of 512 × 512 × 512 voxels were generated, each of which contains approximately 1.5 × 106 speckles. In each volume image 7.29 × 105 (=90 × 90 × 90) POIs are evenly distributed with a space of 5 voxels between the neighbouring POIs. Three kinds of subvolumes with different sizes are selected to carry out the DVC calculation, namely 17 × 17 × 17 voxels, 25 × 25 × 25 voxels and 33 × 33 × 33 voxels.

Fig. 8
figure 8

Computer simulated reference volume image R(x, y, z) and the 10 translated target volume images from T 1 (x, y, z) to T 10 (x, y, z)

Verification of the Accuracy and Precision

The accuracy and precision of the paDVC are studied according to two performance indicators: mean bias error and standard deviation. The mean bias error of the w-component in z-direction is expressed by

$$ {e}_w=\frac{1}{N}{\displaystyle \sum_{i=1}^N\left({w}_i-{w}_{set}\right)} $$
(19)

where N is the number of POIs, w i is the calculated displacement at the ith POI and w set denotes the pre-set sub-voxel displacement. The standard deviation of the measured displacement is defined as

$$ {\sigma}_w=\sqrt{\frac{1}{N-1}{\displaystyle \sum_{i=1}^N\Big({w}_i}-\overline{w}\Big){}^2} $$
(20)

where \( \overline{w}=\frac{1}{N}{\displaystyle \sum_{i=1}^N{w}_i} \) denotes the expectation of the measured w-component.

Figure 9 shows the mean bias errors and standard deviations of sub-voxel displacement calculated by the paDVC using the three kinds of subvolumes, respectively. It can be seen in Fig. 9(a) that the mean bias error of the calculated w-component falls in a narrow band from −0.43 × 10−3 to 0.29 × 10−3 voxels. The accuracy of the paDVC seems not sensitive to the variation of subvolume size in a certain range. In Fig. 9(b), the standard deviations of the measured w-component are below 1.42 × 10−3 voxels. A larger subvolume size leads to a smaller standard deviation for the measurement of sub-voxel displacements, but at a price of significantly increased computation time, as will be discussed in the next subsection. Figure 10 gives the mean bias errors and standard deviations calculated on the same series of volume images without noises. Comparing the results in the two figures, it can be found that the white Gaussian noises reduce the accuracy and precision of the proposed paDVC method. But this adverse effect remains at an insignificant level.

Fig. 9
figure 9

(a) Mean bias errors and (b) standard deviation of the measured w-components on a series of noise-contaminated volume images, calculated using different subvolume sizes

Fig. 10
figure 10

(a) Mean bias errors and (b) standard deviation of the measured w-components on a series of noise-free volume images, calculated using different subvolume sizes

Verification of the Computational Efficiency

To assess the computational efficiency of the proposed paDVC, the same DVC algorithm with sequential implementation (seDVC) and multithreaded implementation (muDVC) based on OpenMP [27] running merely on CPU are used as benchmarks.

In the seDVC, the POIs are processed one by one. The FFTW library [28] is employed to perform the fast Fourier transform, and the peak C zncc as well as its index are obtained by a simple maximum element searching algorithm. The implementation of muDVC is almost identical to the seDVC except that a coarse-grained parallelization is applied. The number of threads is set to be equal to twice the number of physical CPU cores, since the Intel hyper-threading technology allows each CPU core operates at most two simultaneous threads. It is found that a larger thread number does not increase the computation speed further, because these threads actually share the time slices of up to 6 CPU cores. The batch processing scheme is also adopted in the seDVC and the muDVC due to the huge amount of data. To make the comparison fair, the three DVC programs use the same batch size during the experimental study. The Intel single instruction multiple data (SIMD) instruction is enabled for the compiling of all the programs.

Table 1 compares the computation time and computation speed among the paDVC, the muDVC and the seDVC with different subvolume sizes. The paDVC achieves a significant speedup, approximately 3.0× ~ 3.7× over the muDVC and 18.3× ~ 23.3× over the seDVC. Compared with another sequential implementation of ICGN algorithm-based DVC with path-dependent strategy, which was run on a desk computer equipped with similar grade CPU [13], the paDVC also demonstrates its clear advantage in computational efficiency. It can be observed that the increase in subvolume size effectively reduces the iteration number required by the 3D ICGN algorithm. However, a larger subvolume does not lead to a higher computation speed. The computation time taken by all the three DVC programs is substantially increased due to the surge of voxel number involved in calculation.

Table 1 Comparison of the average computation time and speed among the paDVC, the muDVC and the seDVC

Table 2 further explore the average time per batch consumed by the the three DVC programs in the precomputation stage. The paDVC is approximately 9.8× and 54.9× faster than the muDVC and the seDVC, respectively.

Table 2 Comparison of the average computation time consumed per batch in the precomputation step among the paDVC, the muDVC and the seDVC. Batch size means the number of POIs processed in one batch (or in a cube)

Table 3 gives comparisons of average computation time per POI in the stage 2 (3D FFT-CC) and stage 3 (3D ICGN) among the three programs. The calculation of the 3D FFT-CC powered by GPU leads to a speedup ranging from 4.6× to 22.0× over the two CPU-based implementations. And the GPU accelerates the operation of 3D ICGN algorithm in the paDVC for about 3.2 times to 19.8 times, as compared with the muDVC and seDVC. According to the results, it can be found that the most remarkable acceleration of the paDVC is achieved in the precomputation step, which is performed in a highly parallel manner. In the paDVC, each CUDA block can calculate the gradients at 216 voxels and the interpolation coefficients for 147 eight-voxel-units simultaneously, whereas the degree of parallel computing in the muDVC is 12 at most. In the other two steps, the speedup ratios achieved by the paDVC decrease markedly due to the limited CUDA cores in the GPU used in this work, which may allow only a couple of POIs to be processed in parallel. However, the fine-grained parallel computing applied in the calculation at each POI makes the paDVC demonstrate considerably superior computational efficiency over the muDVC.

Table 3 Comparison of computation efficiency for the 3D FFT-CC algorithm and the 3D ICGN algorithm among the paDVC, the muDVC and the seDVC

Conclusion

In this work, a parallel digital volume correlation (paDVC) method is proposed and implemented on GPU devices. The path-dependence of the conventional reliable initial guess transferring schemes widely used in iterative DVC algorithms is eliminated in the proposed paDVC method by introducing the 3D FFT-CC algorithm to estimate the integer-voxel initial guess for the 3D ICGN algorithm at each POI independently. Then a GPU-based parallel implementation of the proposed DVC method has been developed on CUDA. The proposed paDVC employs a variety of techniques and strategies to combine parallel computing and batch processing at coarse-grained and fine-grained level, and demonstrates a significantly enhanced computational efficiency more than 20 times higher than the CPU-based sequential implementation of the same DVC method, without sacrificing accuracy and precision. In comparison with the multithreaded implementation on multi-core CPU, the paDVC also gains an approximately 3× improvement in speed. Therefore, one may reach a conclusion that GPU-based parallel computing could be a superior option to accelerate DVC over the parallel computing based on current multi-core CPU, unless an explosive increase of CPU cores is realized. Moreover, since a high-end CPU with more than 8 cores is still quite expensive, the paDVC provides the end-users the possibility to carry out large scale DVC computation efficiently on their desktop computers equipped with low or medium level graphics cards for measurement or research work.