1 Introduction

Fringe analysis is a widely used technique in optical metrology to recover physical quantities such as displacement, strain, surface profile, and refractive index from interferograms. Interferograms are two-dimensional recordings made by a digital camera of interference patterns. Encoded in the interference fringes or bands of the interferogram is the shape of the wavefront [13]. Fringe analysis is then the extraction of the quantitative measurement data from either a single fringe pattern or a collection of them [25]. Fringe analysis consists of one or two processes: phase demodulation and phase unwrapping [30].

The mathematical model of a fringe pattern is given by

$$\begin{aligned} I(x, y) = a(x, y) + b(x, y) \cos (\psi (x, y) + \phi (x, y)), \end{aligned}$$
(1)

where I(xy) is the image intensity, \(a = a(x, y)\) is the background illumination, \(b = b(x, y)\) is the amplitude modulation, \(\phi = \phi (x, y)\) is the phase, and \(\psi = \psi (x, y)\) is the spatial carrier frequency. The main task of fringe analysis algorithms is to recover the phase term \(\phi\), and in some cases, the background illumination and amplitude modulation as well. That is an inverse problem, because only the fringe pattern I(xy) and sometimes the carrier frequency \(\psi\) are known.

In the last years, many techniques have appeared for the solution to the problem mentioned above. They depend on obtaining one or more fringe patterns from the experiments. For instance, phase-shifting techniques recover the phase map \(\phi\) by acquiring a collection of fringe patterns shifted one from another by a certain amount [32]. The phase difference between two consecutive fringe patterns is usually a constant term [1, 8, 15, 29, 30].

Among the variety of techniques for demodulating the phase map using a single fringe pattern, Takeda’s method was one of the first to be proposed. This method is based on the Fourier transform and considers the phase map as a continuous and smooth function [33]. In recent years, some papers appeared with solutions based on regularized Bayesian estimation costs. Regularization uses a priori information to impose restrictions on the estimated solution. The success rate of these techniques at recovering the phase term \(\phi\) from highly noisy patterns is high; however, their numerical solution is usually quite expensive [18, 21, 28]. An extensive review of these methods is available in [30] and references therein.

All demodulation methods described above fail to recover discontinuous or piece-wise phase maps. Up to our knowledge, there are very few methods reported in the literature capable of recovering these kinds of phase maps [2, 3, 9, 17, 31, 36]. One of them, based on a regularized cost function, uses a second-order edge-preserving potential [9]; another is based on wavelets using total variation (TV) regularization [36]. The rest are based on variational formulations using either TV or mean curvature regularization [20, 26]. Except for the reference [31], these methods estimate the phase term, the background illumination, and the amplitude modulation from a single fringe pattern. We remark that all methods described in this paragraph report serial realizations for their models. Consequently, their running CPU times are much slower than the parallel implementation of our model presented here.

For a model to succeed in the transition from applied research to industrial applications, a reliable and fast numerical realization of the model needs to be available. Despite the accurate recovery of the phase term, the computational times reported in the demodulation techniques discussed above show the need to provide them with fast computational solvers. In this work, we address this issue for the TV model by introducing a very quickly parallel realization of this model based on the fixed-point algorithm introduced in [17] and whose convergence was proved in [3].

Other methods for solving the model in [17] are a very slow gradient descent algorithm and a recently published augmented Lagrangian algorithm [16]. The realization of both algorithms is serial hence slower than the parallel one introduced here.

The outline of this paper is as follows. In Sect. 2, we review shortly the TV model. In Sect. 3, we review the fixed-point algorithm. In Sect. 4, we present the parallel realization for multi-core CPU and GPU architectures. The experimental results on both synthetic and experimental data are presented in Sect. 5 and our conclusions are given in Sect. 6.

2 Total variation based model

The TV model presented in [17] amounts to solving the following problem:

$$\mathop {\arg \min }\limits_{{a,b,\phi }} \;\;TV = TV\left( {a,b,\phi ,\psi ,g} \right),$$
(2)

where

$$\begin{gathered} TV \equiv \int_{\Omega } {(I - g)^{2} } {\text{d}}\Omega + \frac{1}{{\lambda _{a} }}\int_{\Omega } | \nabla a|{\text{d}}\Omega \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,{\mkern 1mu} {\mkern 1mu} {\mkern 1mu} + \frac{1}{{\lambda _{b} }}\int_{\Omega } | \nabla b|{\text{d}}\Omega + \frac{1}{{\lambda _{\phi } }}\int_{\Omega } {\left| {\nabla \phi } \right|} {\text{d}}\Omega , \hfill \\ \end{gathered}$$
(3)

where \(\Omega \subseteq {\mathbb {R}}^2\), \(g=g(x,y)\) is the acquired fringe pattern, and \(\lambda _a,\lambda _b, \lambda _\phi\) are positive regularization parameters.

Due to the total variation regularization, this model can recover sharp phase-transitions, background illumination, and amplitude modulation.

The solution of (2) is obtained by numerically solving the following set of second-order nonlinear Euler–Lagrange equations, one for each variable:

$$\begin{aligned} -\nabla \cdot \frac{\nabla a}{|\nabla a|} + \lambda _a(a+bc_\phi -g)=0, \end{aligned}$$
(4)
$$\begin{aligned} -\nabla \cdot \frac{\nabla b}{|\nabla b|} + \lambda _b(a+bc_\phi -g)c_\phi =0, \end{aligned}$$
(5)
$$\begin{aligned} -\nabla \cdot \frac{\nabla \phi }{|\nabla \phi |} + \lambda _\phi (a+bc_\phi -g)(-bs_\phi )=0 \end{aligned}$$
(6)

with \(c_\phi =\cos (\psi +\phi )\), \(s_\phi =\sin (\psi +\phi )\), and boundary conditions

$$\begin{aligned} \frac{\partial a}{\partial \nu }=0, \;\; \frac{\partial b}{\partial \nu }=0, \;\; \frac{\partial \phi }{\partial \nu }=0, \end{aligned}$$
(7)

where \(\nu\) denotes the unit outer normal vector to the boundary.

3 The fixed-point algorithm

In this work, we focus on developing a parallel realization of a fixed-point algorithm framework for solving each partial differential equation (PDE) presented in (46). This algorithm introduced in [3] is convergent for any initial guess, and its structure is suitable for parallelization. We proceed to review the fixed-point algorithm.

To solve each PDE, an algorithm of the form (for general u)

$$\begin{aligned} L(u^k) u^{k+1}=f^k \end{aligned}$$
(8)

is constructed, where \(L=L(u)\) is a linearized operator given by

$$\begin{aligned} L = \left\{ \begin{array}{cc} -\nabla \cdot \frac{\nabla }{|\nabla a^k|} + \lambda _a I &{} \text {in}\; (4) \\ -\nabla \cdot \frac{\nabla }{|\nabla b^k|} + \lambda _b c_{\phi ^k}^2 &{} \text {in}\; (5) \\ -\nabla \cdot \frac{\nabla \phi ^k }{|\nabla \phi ^k |} + \lambda _\phi b^2 s_{\phi ^k}^2 &{} \text {in}\; (6) \\ \end{array} \right. \end{aligned}$$

and

$$\begin{aligned} f^{k} = \left\{ \begin{array}{cc} \lambda _a(-b^kc_{\phi ^k}+g) &{} \text {in}\; (4) \\ \lambda _b(-a^k+g)c_{\phi ^k} &{} \text {in}\; (5) \\ -\lambda _\phi s_{\phi ^k}(-a^kb^k+gb^k- \\ ({b^k})^2(c_{\phi ^k}-s_{\phi ^k}\phi ^k)) &{} \text {in}\;(6). \\ \end{array} \right. \end{aligned}$$

Therefore, given an arbitrary initial guess, each fixed-point algorithm constructs a convergent sequence of solutions of the type \(\{u^k\}_{k\ge 1}\). Note that L is a linear operator that in (4) and (5) is obtained by lagging the nonlinear diffusion coefficients at every k-iteration and in (6) using the Taylor expansion of first order of the cosine function. The linear system (8) does not need to be solved very accurately, and few iterations of any sparse linear solver suffice. The operator L has some nice properties; it is symmetric, positive definite, and diagonally dominant for a and b, even though it is only semi-positive definite and weakly diagonally dominant for \(\phi\).

4 Parallel implementation

We develop two parallel implementations in C/C++ of the serial phase demodulation algorithm (see Algorithm 1): the first one uses a multiple core system (here referred to as multi-core CPU) and the second one uses a GPU-based architecture. We use OpenMP for multi-thread programming on the multi-core CPU system, while on the GPU architecture, we use CUDA programming. We also use the open computer vision library OpenCV [12] and the high-performance vector mathematics library Blitz++ [34] in both implementations: OpenCV only for reading and writing images, and Blitz++ for array management in the multi-core CPU.

We now proceed to describe the discretization scheme for the PDEs. For this purpose, let \(\Omega = [0,n]\times [0,m]\) be a continuous domain and let \((h_x,h_y)\) represent a vector of finite mesh sizes. Then, the discrete domain \(\Omega _h\) can be defined as \(\Omega _h = \Omega \cap G_h\), where \(G_h=\{(x,y):x=x_i =ih_x, y=y_j=jh_y; i,j\in {\mathbb {Z}}\}\) is an infinite grid. Take u as an \(n \times m\) array where each entry \(u_{i,j}\) for \(i=1,...,n\) and \(j=1,...,m\) is the discrete value of the continuous variable on the grid \(\Omega _h\) at some point (xy). In what follows, we use u to represent any of the variables a, b, or \(\phi\).

Algorithm 1 evaluates the TV functional in (3), the boundary conditions (BC), and the Gauss–Seidel (GS) method in an iterative way, until the normalized error Q between previous and current solutions is less than a given threshold value \(\epsilon\) or the maximum number of iterations \(\text {MaxIter}\) is achieved. The procedure in Algorithm 2 computes the value of (3) by approximating the gradient operator \(\nabla u_{i,j}\) as follows:

$$\begin{aligned} \nabla u_{i,j} = \left( \frac{u_{i+1,j}-u_{i,j}}{h_x},\frac{u_{i,j+1}-u_{i,j}}{h_y}\right) . \end{aligned}$$
(9)

The BC procedure computes the Neumann boundary conditions with the following equations:

$$\begin{aligned} u_{i,1}= & {} u_{i,2}, u_{i,m}=u_{i,m-1}, \nonumber \\ u_{1,j}= & {} u_{2,j}, u_{n,j}=u_{n-1,j}, \end{aligned}$$
(10)

for \(i=1,...,n\) and \(j=1,...,m\).

Let \(\mathbf {v}_{i,j}=[u_{i+1,j},u_{i-1,j},u_{i,j+1},u_{i,j-1}]\) be a column vector containing the four neighbors of u and let

$$\begin{aligned} \mathbf {w}_{i,j}&=[w_1,w_2,w_3,w_4]\nonumber \\&=\left[ \frac{1}{|\nabla u_{i+1,j}|_\beta },\frac{1}{|\nabla u_{i-1,j}|_\beta }, \frac{1}{|\nabla u_{i,j+1}|_\beta },\frac{1}{|\nabla u_{i,j-1}|_\beta }\right] \end{aligned}$$

be the corresponding vector of regularized nonlinear terms approximated by

$$\begin{aligned} |\nabla u_{i+1,j}|_\beta =|\nabla u_{i,j+1}|_\beta = \sqrt{(u_x^1)^2+(u_y^1)^2+\beta }, \end{aligned}$$
(11)
$$\begin{aligned} |\nabla u_{i-1,j}|_\beta = \sqrt{(u_x^2)^2+(u_y^2)^2+\beta }, \end{aligned}$$
(12)
$$\begin{aligned} |\nabla u_{i,j-1}|_\beta = \sqrt{(u_x^3)^2+(u_y^3)^2+\beta }, \end{aligned}$$
(13)

where \(\beta > 0\) is a small parameter to avoid division by zero, and

$$\begin{aligned} u_x^1=&(u_{i+1,j}-u_{i,j})/h_x, u_y^1=(u_{i,j+1}-u_{i,j})/h_y, \\ u_x^2=&(u_{i,j}-u_{i-1,j})/h_x, u_y^2=(u_{i-1,j+1}-u_{i-1,j})/h_y, \\ u_x^3=&(u_{i+1,j-1}-u_{i,j})/h_x, u_y^3=(u_{i,j}-u_{i,j-1})/h_y \end{aligned}$$

are the derivatives approximated by finite differences. In our simulations and without loss of generality, we considered the spatial step sizes to be equal in both directions, that is, \(h=h_x=h_y\). For the regularization parameters, it was enough to select them all equal, i.e., \(\lambda =\lambda _a=\lambda _b=\lambda _\phi\).

The GS procedure shown in Algorithm 3 computes an approximate solution of a, b, and \(\phi\) using the GS method with red-black ordering. The update of each variable is as follows:

$$\begin{aligned} u_{i,j}^{q+1} = \frac{\mathbf {v}^T_{i,j}\mathbf {w}_{i,j} + f_{i,j}}{D_{i,j} + 2w_1 + w_2 + w_4}, \end{aligned}$$
(14)

where all the terms in the right hand are evaluated at the \(q{\text{th}}\) iteration, and

$$\begin{aligned} D_{i,j} = \left\{ \begin{array}{cc} \lambda &{} \text {in}\; (4) \\ \lambda c^2_{\phi _{i,j}} &{} \text {in}\; (5) \\ \lambda b_{i,j}^2s^2_{\phi _{i,j}} &{} \text {in}\; (6). \\ \end{array} \right. \end{aligned}$$

For computing the normalized error Q, we use

$$\begin{aligned} Q(z_1,z_2)=\Vert z_1-z_2\Vert / (\Vert z_1\Vert +\Vert z_2\Vert ), \end{aligned}$$
(15)

as reported in [19, 24]. This equation defines a relative error without considering physical dimensions. The Q values live in the interval [0, 1], with Q approaching zero, while \(z_1\) and \(z_2\) get closer to each other.

For the TV and Q evaluations, we need a parallel reduction procedure [4, 5] to compute the corresponding sum. For the BC procedure, only assignment statements are necessary, and they are independent of each other. In both implementations, serial and parallel, we use Gauss–Seidel with red-black ordering; thus, our serial and parallel results are the same. The following subsections describe the implementation of these procedures.

figure a
figure b
figure c

4.1 OpenMP implementation

OpenMP is an API for shared-memory parallel programming in a multi-core CPU architecture [23]. Thus, OpenMP is suitable for systems in which each thread or process can access all available memory. OpenMP provides a set of directives or pragmas used to specify parallel regions. Among other things, these directives are efficient to manage threads inside parallel regions and to distribute for loops in parallel.

For the TV and Q evaluations, we use the directive \(\texttt {\#pragma}\) \(\texttt {omp}\) \(\texttt {parallel}\) \(\texttt {for}\) \(\texttt {reduction(+:sum)}\). For the \(\text {BC}\) procedure, we use the directive \(\texttt {\#pragma}\) \(\texttt {omp}\) \(\texttt {parallel}\) \(\texttt {for}\). In the Gauss–Seidel algorithm with red-black ordering [7], the pixels are considered red or black following a chessboard pattern. We consider a pixel \(r=(i,j)\) red if \(i+j\) is even and black if \(i+j\) is odd (see Algorithm 3). Then, when the red pixels are updated in the for loop, they only need the black pixel values and vice versa. This reordering aims to get an equivalent equation system in which there are more independent computations [27], resulting in efficient parallel implementations of the GS.

The directive \(\texttt {\#pragma}\) \(\texttt {omp}\) \(\texttt {parallel}\) opens a parallel region and with the \(\texttt {for}\) directive, parallelizes the for loop. In shared-memory programs, the individual threads have private and shared memory. Communication is accomplished through shared variables. Inside the \(\texttt {\#pragma}\) \(\texttt {omp}\) \(\texttt {for}\), all variables are shared by default for all threads; then, we use only the \(\texttt {private}\) directive to declare private variables for each thread.

Additionally, we use the directive \(\texttt {num}\) \(\texttt {threads(N}_\texttt {T)}\) to specify the number of threads \(\texttt {N}_\texttt {T}\) to be launched in our program.

4.2 CUDA implementation

CUDA (compute unified device architecture) is an extension to the C language that contains a set of instructions for parallel computing in a GPU. The host processor spawns multi-thread tasks (Kernels) onto the GPU device, which has its internal scheduler that will then allocate the kernels to whatever GPU hardware is present [6]. The GPU is used for general-purpose computation; it contains multiple transistors for the arithmetic logic unit, based on the single instruction and multiple threads (SIMT) programming model, which is exploited when multiple data are managed from a single parallel instruction, similar to the single instruction multiple data (SIMD) model [5, 14].

Following the conventional programming model in GPU, once \(a^0,b^0,\phi ^0,\psi ,g\) are allocated in the CPU memory, we reserve their corresponding memory space in the GPU device. Then, these variables are loaded in the GPU device from the CPU through a memory copy process. In our program, we define four main kernel functions for the TV, BC, GS, and Q procedures. For the TV and Q procedures, we implemented a parallel reduction with dynamic shared memory and the interleaved pair strategy [5]. Hence, the size of the array is divided into thread blocks; in each thread block, a partial sum is computed using shared memory; then, these partial sums are copied back to the CPU memory and summed in the CPU. Note that the last sum is computed in parallel in the multi-core CPU.

Another way to implement the reduction of the partial sum is using the atomic function \(\texttt {atomicAdd}\). This function reads a value from some address in global or shared memory, adds a number to it, and writes the result in the same address; no other thread can access that address until the operation is complete. In this way, we avoid the memory copy of the partial sums using only the GPU for all processing. A consideration to take into account is that \(\texttt {atomicAdd}\) is only supported by Nvidia GPUs of computing capability 6 and higher when 64-bit floating-point data are used (see the atomic functions section in [22] for more details).

5 Experimental results

The experiments were executed on a server with Intel(R) Xeon(R) Gold 5222 CPU 3.80 GHz, Ubuntu 18.04 (64-bits), 16 hyper-threading cores, 48GB RAM, and a video card Nvidia Quadro RTX 8000 with compute capability 7.5. We fix the parameters \(\lambda =10\), \(MaxIter=5\times 10^5\), and \(\epsilon =10^{-7}\). For the GS procedure, only a few iterations \(\zeta\) are used as recommended in [3]; thus, we fix \(\zeta =4\). We develop two versions of our serial and parallel implementations, one version using 64-bit floating-point data (double precision) and the other using 32-bit floating-point data (float precision).

We note that similar models for demodulating discontinuous maps [2, 9] are only equipped with gradient descent algorithms with serial realizations. Therefore, their running times are very slow compared with those obtained with our parallel algorithm. For instance, in [17], it was reported a processing time of 800 s to solve the problem of Fig. 2 for an image of size \(250 \times 250\) pixels using the model in [9], while a processing time of 144 seconds was reported for the same problem also using a gradient descent algorithm for the TV model [17]. We ran our parallel algorithm for the same problem obtaining a processing time at least ten times faster than the one reported in [17]. We remark that serial gradient descent algorithms do not scale well with the size of the images, while our parallel algorithm does. Furthermore, the computational realization of the model in [2] is even slower, being that a fourth-order and highly nonlinear PDE has to be solved.

In our experiments, Fig. 1 shows the number of iterations, processing time in seconds, and normalized error Q between the desired phase \(\phi\) and the estimated phase \(\phi ^*\) of our parallel phase demodulation implementation with double precision. As is shown there, we ran the GPU simulations for different values of \(\beta\) and image sizes. We can see that \(\beta =10^{-4}\) is a good compromise between processing time and precision; we use this value for the rest of our synthetic experiments.

Fig. 1
figure 1

Number of iterations, processing time, and normalized error of our parallel phase demodulation method, for different values of \(\beta\) and different image sizes

Figure 2 shows a result of our phase demodulation method with double precision using synthetic data. The images have a size of \(240\times 320\) pixels. The first row shows the spatial carrier frequency \(\psi\), the fringe pattern g, and the desired phase to estimate \(\phi\), respectively. The second row shows the initial value of background illumination \(a^0\), amplitude modulation \(b^0\), and phase estimation \(\phi ^0\) at iteration \(k=0\). The third row shows the optimal estimations \(a^*\), \(b^*\), and \(\phi ^*\). We obtain a normalized error \(Q(\phi ,\phi ^*)=0.0279\).

Fig. 2
figure 2

Phase demodulation using synthetic data

The speedup of a parallel program is defined as

$$\begin{aligned} S=\frac{T_s}{T_p} \, , \end{aligned}$$
(16)

where \(T_s\) is the processing time of serial program and \(T_p\) is the processing time of parallel program [23].

Figure 3 shows the speedup of our parallel implementation using the multi-core CPU with double precision, launching from \(N_T=2\) to \(N_T=32\) threads. The server has 16 hyper-threading cores, and then, each physical core is divided into two virtual or logical cores, sharing resources such as the instruction pointers, integer registers, floating-point registers, scheduling queues, caches, and execution units. The performance of a parallel implementation declines provided a virtual-core monopolizes some critical resources such as the floating-point registers or the caches. Increasing the performance of a parallel implementation is fundamentally an optimization problem, which is very difficult due to different memory hierarchies between platforms and the variation of core connection on a single processor [11]. We achieve the best speedup in our parallel implementation when launching \(N_T=16\) threads for the different image sizes, while for \(N_T>16\), the performance declines.

Fig. 3
figure 3

Evaluation of speedups of parallel phase demodulation algorithm, using the multi-core CPU from 2 to 32 threads for different image sizes

Figure 4 shows the speedup of our parallel implementation using the GPU with double precision. In our parallel GPU implementation, we split the reduction process into two sums. The first one is executed in the GPU, and the second one can be executed in parallel in the CPU. We launch from \(N_T=2\) to \(N_T=32\) threads in our multi-core CPU; note that the speedup is almost constant for \(N_T<=16\) and has a small decline for \(N_T>16\). We can see that it is enough to use only \(N_T=2\) threads to obtain the best performance, demonstrating that the more demanding process is computed in the GPU.

Fig. 4
figure 4

Evaluation of speedups of parallel phase demodulation algorithm, using a GPU and the multi-core CPU from 2 to 32 threads for different image sizes

Tables 1, 2, 3, and 4 show a comparative performance of our serial and parallel implementations with double and float precision respectively, for different image sizes of our synthetic data. The processing time is measured from the start of Algorithm 1 until it terminates, which means that we are considering both the processor work and the memory transfer time between the CPU and the GPU. We execute the parallel implementations 20 times in each experiment, reporting the mean and standard deviation of the number of iterations k, the processing time, and the normalized error Q. We can see that our implementations using GPU have the shortest times.

Table 1 Number of iterations, processing time (seconds), and normalized error Q of our serial and multi-core CPU implementations using double precision
Table 2 Number of iterations, processing time (seconds), and normalized error Q of our GPU implementations using double precision

When floating-point arithmetic is used, the rounding errors can lead to unexpected results. There are calculations with real numbers that produce quantities that are not exactly represented in float or double precision [10]. Rounding modes for basic operations, including subtraction, multiplication, and division, are specified in the IEEE-754 standard; the most frequently used is the round-to-nearest mode [35]. In the results of our implementations using double precision (see Tables 1 and 2), we obtain the same number of iterations and the same normalized error to 12 decimal places.

Table 3 Number of iterations, processing time (seconds), and normalized error Q of our serial and multi-core CPU implementations using float precision
Table 4 Number of iterations, processing time (seconds), and normalized error Q of our GPU implementations using float precision

The TV and Q procedures of our algorithm need to compute a summation of all elements of an array of the size of the image to be processed; this operation may propagate rounding errors as can be seen in Tables 3 and 4, where we use float precision. Note that the number of iterations is different concerning each implementation. It is important to note that the serial-float implementation does not scale linearly with the size of the images. We expect the algorithm to get a better speedup for larger size images than that one for smaller ones. Table 3 shows that this is not true for the image of size \(1920\times 2560\), for which the speedup is inferior to the one obtained for the image of size \(960\times 1280\). If we ignore this result, then the normalized errors are the same to four decimal places for all our implementations using float precision. They are the same to three decimal places taken into account the results with double precision. However, even when we consider the result of the serial implementation for an image of size \(1920\times 2560\), the normalized error is the same as rounding to two decimal places for all implementations with a float or double precision, which is not bad for the quality of the recovered phase map.

On the other hand, the results of our \(GPU - 2C\) implementation with float precision (see Table 4) show that the number of iterations and normalized errors have zero standard deviation. They are the same in each execution for each size of the image. That is indeed a requirement in a deterministic algorithm.

To compare our parallel implementations, Figs. 5 and 6 show the speedup of each implementation for different image sizes using double and float precision, respectively. For the case of the multi-core CPU, we launch from \(N_T=2\) to \(N_T=20\) threads, while for the case of \(GPU - 2C\), we launch only \(N_T=2\) threads in the CPU, and for the case of \(GPU - atom\), we use the \(\texttt {atomicAdd}\) function to avoid some memory copies between CPU and GPU. Note that GPU implementations have almost the same speedup level in each of our experiments, achieving the best performance for all image sizes. The advantage of \(GPU - atom\) is that all computations are realized in the GPU. Thus, other programs can be running in the CPU without compromising performance. For the multi-core CPU implementation, the speedup goes from \(S=2\) to \(S=7\), and for the GPU implementations, it goes from \(S=25\) to \(S=58\) using double precision; while using float precision for multi-core CPU implementation, the speedup goes from \(S=4\) to \(S=9\), and for the GPU implementations, it goes from \(S=57\) to \(S=103\).

Fig. 5
figure 5

Evaluation of speedups of parallel phase demodulation algorithm, using the multi-core CPU (from 2 to 20C), GPU with 2C, and the GPU with \(\texttt {atomicAdd}\) for different image sizes and double precision

Fig. 6
figure 6

Evaluation of speedups of parallel phase demodulation algorithm, using the multi-core CPU (from 2 to 20C), GPU with 2C, and the GPU with \(\texttt {atomicAdd}\) for different image sizes and float precision

The first row in Fig. 7 shows the results of our parallel phase demodulation method with double and float precision when using real experimental data. The presented data consist of two images (\(\psi\) and g) of size \(480\times 640\) pixels. The second row shows the results with \(\lambda =10\) and \(\beta =10^{-4}\) (the same parameters as those used for the synthetic data experiments); the algorithm takes \(k=16269\) iterations, 880.84 s in serial version and 24.68 s in \(GPU-2C\) version with double precision, and takes \(k=13402\) iterations and 7.96 s in \(GPU-2C\) version with float precision. The third row shows the results when using \(\lambda =2\) and \(\beta =10^{-7}\); here, the algorithm takes \(k=45173\) iterations, 2446.78 s in the serial version and 68.58 s in the \(GPU-2C\) version with double precision, and takes \(k=41455\) iterations and 23.25 s in the \(GPU-2C\) version with float precision. Qualitatively, we can see that the results using double and float precision are very similar, even when the algorithm converges at a different number of iterations.

Fig. 7
figure 7

Phase demodulation using real experimental data

6 Conclusions

In this paper, we presented parallel implementations of the fixed-point algorithm for solving the total variation phase demodulation model. We presented results from synthetic and real experiments in a server with 16 hyper-threading cores and a GPU.

The multi-core CPU implementation achieves the best speedup (9x) using 16 threads. The two GPU versions of our implementations (\(GPU-2C\) and \(GPU-atom\)) obtain almost the same speedup. That outperforms the speed of the multi-core CPU implementations. In general, the GPU implementations achieve a speedup of up to 103x over the serial implementation.

The results show that the parallel implementations converge and obtain the same number of iterations and normalized error to 12 decimal places when we use double precision. In contrast, due to the propagation of rounding errors, we obtain a different number of iterations when using float precision; however, the resultant normalized errors of our parallel implementations with float precision are the same to three decimal places as that for double precision. Qualitatively, the estimated phases are very similar.

As future work, we intend to explore the usage of multigrid algorithms to speed the algorithm even further.