Keywords

1 Introduction

Cardiac disease remains the leading cause of death worldwide [1]. Ventricular fibrillation (VF), a life-threatening arrhythmia, is associated with disruption of the ventricular electrophysiological signalling that controls the contraction of the heart muscle. Such disruption manifests as spatiotemporally disorganized electrical waves [2,3,4,5] that require immediate intervention. Another form of arrhythmia that can last for years and impair the quality of life of patients is atrial fibrillation (AF). It is estimated that over 2.7–6.1 million Americans suffer from AF [1]. If untreated for prolonged periods, AF can lead to more problematic arrhythmias or even stroke.

It is possible that VF and AF treatment could be improved by designing patient-specific prevention, control and/or therapy using new computational tools that are fast, accessible and easy to use [6]. In fact, numerical simulations of cardiac dynamics are becoming increasingly important in addressing patient-specific interventions [7] and evaluating drug effects [8]. It is noteworthy that the Food and Drug Administration recently sponsored a new Cardiac Safety Research Consortium initiative (CiPA) [8, 9] that specifies the use of mathematical models of cardiac cells to aid pro-arrhythmic drug risk assessment. However, as the mathematical models incorporate more detailed and sophisticated biophysical mechanisms, they are becoming extremely complex mathematically, with some of them requiring the solution of 50–100 nonlinear ordinary differential equations (ODEs) per computational cell [10]. Such ODEs typically are stiff and thus require a small temporal discretization, which is further complicated by the spatial discretization size imposed by the size of the cardiac cells. These complicating factors make cardiac dynamic simulations too large for traditional serial CPU-based computing. While some efforts have been made to create programs to aid with cardiac cell simulations in PCs [11, 12], in general, scientists have used supercomputer-based high-performance implementations of cardiac models to study cardiac electrophysiology, especially for large two- and three-dimensional tissues. However, supercomputers are expensive to acquire and hard to maintain, and even when such resources are managed by individuals other than the end users, users typically are required to submit their programs for execution as batch jobs, which can be inconvenient.

Substantial advances in the computational power of graphic processing units (GPUs) have made them an attractive alternative to traditional high-performance computing. Currently available GPUs are equipped with thousands of powerful computational cores, and they can be acquired at affordable prices sometimes as low as a few hundred US dollars. As such, they can provide high-performance computing on personal computers at merely a fraction of the cost of traditional CPU-based supercomputers. However, GPUs require machine code that is prepared for the specific target GPU hardware. Thus, computer codes either need to be implemented in a special language that is intended for GPU programming or should be modified such that they become suitable for execution on GPUs. At present, there are several languages and programming solutions that enable implementation of GPU applications. As might be expected, each solution and programming language has certain benefits and may perform differently for different applications. Therefore, a comprehensive study focused on comparing the ease of programming and performance of such programming languages and solutions when applied to cardiac models can be beneficial to help researchers in the cardiac community choose the appropriate approach.

In this study, we investigate some of the major languages and solutions in cardiac GPU computing. Specifically, we consider (1) GPU computing solutions available in MATLAB, (2) the pragma-based approach of OpenACC, (3) Python-Numba, (4) TensorFlow, (5) WebGL 2.0, and (6) NVIDIA CUDA together with the Abubu.js library. Our comparisons will be based on implementations without any substantial program-specific optimization. Of course, we expect that applying language-specific optimizations could improve performance. However, it is fair to assume that most cardiac researchers are not necessarily experts in GPU programming, so that in many cases the solution that would provide the best performance with minimal effort would be ideal. Nevertheless, our comparisons will help users with a broad range of programming expertise make informed choices about GPU implementations for cardiac models.

2 Methods

2.1 Models

We will compare performance using three different models with different complexity. The FitzHugh-Nagumo (FHN) model [13, 14] is a two-variable model used as a generic excitable media model and in some cases as a cardiac model. Tuning the model’s parameters can change features like the trajectory of the spiral wave tip [15].

The Minimal Model (MM) [16] is a four-variable model developed to reproduce many important properties of cardiac cells while also prioritizing computational tractability. The model includes a variable representing voltage as well as three gating variables that govern the dynamics of summary sodium and calcium currents; a time-independent potassium current also is included. Different parameterizations of the MM have been shown to reproduce the dynamics of other models with good fidelity [6, 16,17,18].

The Beeler-Reuter (BR) model [19] is an eight-variable model that includes sodium, calcium, and potassium currents. It was the first model developed to simulate ventricular tissue and the first to include an intracellular calcium concentration. We made modifications to the BR model by speeding up the \(\tau _f\) and \(\tau _d\) in the model to \(50\%\) of their original value to prevent the model from breakup [20]. If the original model was used with the default parameter set, it would gives rise to spiral wave breakup in two dimensions [20, 21]. This is also the first model for which it was shown that reaction-diffusion equations for cardiac cells can produce spiral waves in 2D [21].

2.2 Numerical Methods

The cardiomyocytes’ membrane potential (V) propagation through gap junctions (and in neurons through synapses) can be modeled by a cable equation [22], which is given by

$$\begin{aligned} \begin{aligned} \partial _t V(\varvec{x},t)=&\nabla \cdot (\tilde{D}\nabla V)-\frac{I_{total}}{C_m}.\\ \end{aligned} \end{aligned}$$
(1)

Here, the membrane potential diffuses with a diffusion coefficient \(\tilde{D}\) (which represents the fiber orientation of the heart [23, 24] and, in general, is anisotropic and heterogeneous), while the ionic concentrations are local in cardiac as well as neuronal tissues. The transmembrane currents for all ions as well as the ion pumps and exchangers are included in \(I_{\text {total}} = \sum I_i(V,y_i) \). The most general form of a transmembrane current \(I_i\) permeable to ion i is simply \(I_i=g_i(V - E_i)\), where \(g_i\) is a conductance term, V is the membrane potential or voltage, and \(E_i\) is the Nernst potential for ion species i. Often, the conductance is calculated using gates following the Hodgkin-Huxley [22] formalism, in which the conductance term is decomposed into the product of a maximal conductance term and one or more separate normalized variables that represent the probability of finding the channel open, which typically depends on the membrane potential or an ion concentration. These variables follow first-order differential equations of the form

$$\begin{aligned} \begin{aligned} \frac{dy_i(t)}{dt}=&\alpha _{y_{i}}(V)(1-y_i)-\beta _{y_i}(V)y_i \end{aligned} \end{aligned}$$
(2)

where \(\alpha _{y_i}\) is the probability that the channel gate \(y_i\) will transition from closed to open and \(\beta _{y_i}\) is the probability it will transition from open to closed; both probabilities are a function of voltage. An alternative representation used in some models is achieved through Markov chains, where each state \(s_p\) follows a differential equation of the form

$$\begin{aligned} \begin{aligned} \frac{ds_p(t)}{dt}=&\sum _{q=1,q\ne p}^n (k_{qp}s_q - k_{pq}s_p), \end{aligned} \end{aligned}$$
(3)

where \(k_{qp}\) is the transition rate from state \(s_q\) to \(s_p\). With either formulation, the ordinary differential equations become partial differential equations once a spatially extended system, rather than a single cell, is considered. More details on how to numerically integrate these equations including convergence and boundary conditions can be found in Ref. [25].

In all cases, we used a domain size of \(20 \times 20 \) cm. The diffusion coefficient \(\tilde{D}\) was assumed to be isotropic and homogeneously defined over the domain. Finite differences were used for numerical simulations. To discretize the spatial term in Eq. (1), a second-order central difference scheme was used both in the x and y directions. All ODEs were solved using the forward Euler time-stepping scheme for most variables. As an exception, the time-integration of the Hudgkin-Huxley-type gates in Eq. (2) used the Rush-Larsen time-stepping scheme [26]. In all cases, a uniform Cartesian grid was employed. The grid sizes used were \(256 \times 256\), \(512 \times 512\), \(1024 \times 1024\), and \(2048 \times 2048\). This implies that for smaller grid sizes, the solution was not fully numerically resolved. However, we emphasize that our objective was to compare the same solution obtained under different conditions. This would guarantee that for the same model, we deal with the same loading conditions on the GPU cores. The time step was chosen as \(\varDelta t = 0.05 \,\text {ms}\) up to a grid size of \(2048 \times 2048\), where \(\varDelta t=0.01 \,\text {ms}\) was employed instead to satisfy the CFL condition. Our initial conditions were set to the resting state of the cells everywhere in the domain, except for nodes with \(x<1 \,\text {cm}\) to create a traveling wave toward the right-hand side of the domain. Later, at \(t=600\) for the FHN model and at \(t=370\,\text {ms}\) for the MM and BR models, a depolarizing wave is applied at the bottom half of the domain where \(y<10 \,\text {cm}\) by changing the transmembrane potential to a higher depolarizing potential. This voltage was set to 1.0 for the FHN and MM model and \(30 \,\text {mV}\) for the BR model. For more information on the implementation details, see the computer codes that can be downloaded from http://abouzar.net/SmolkaFest2019/codes.zip (Fig. 1).

Fig. 1.
figure 1

Membrane potential for the FHN (first row), MM (second row), and BR (third row) models at the initial time (first column), application of the depolarizing voltage from the bottom half of the domain (second column), transient spiral wave dynamics (third column), and after the spiral wave stabilizes (fourth column).

Because the details of GPU programming are closely connected with the different implementations studied, this information is provided below in the next section.

3 Comparison of GPU Implementations

Below, we describe six different GPU implementations of the three models (FHN, MM, and BR). In some cases, we also compare additional options available for a particular configuration. Along with measurements of speedup as a function of the number of grid points, we also comment on ease of programming.

First, we implemented a serial version of all three models in the C programming language. The PGI-C compiler was used to generate the machine code. This serial version was used in all speedup calculations. The speedup was defined as follows:

$$\begin{aligned} \text {Speedup}=\frac{\text {wall-time of single-core serial CPU C-program}}{\text {wall-time of GPU implementation}}, \end{aligned}$$
(4)

where wall-time is the measured time of execution of the program that an ordinary wall-clock would measure, albeit here, we used the computer’s clock for measurements.

All measurements were carried out on a Linux Manjaro operating system with Kernel version 4.19.34. The system had an AMD\(^\circledR \) Ryzen threadripper 2990wx 32-core processor that was used for CPU time measurements (although only one core was used in the CPU case). The graphics card that was used for GPU measurements was a NVIDIA TITAN V/PCIe/SSE2.

In this study, all measurements were carried out in double precision, except for the WebGL 2.0 and TensorFlow cases. For the CUDA and OpenACC implementations, we tried single-precision calculations and the speedups did not change more than \(10\%\) on this GPU.

3.1 MATLAB

MATLAB, originally meaning matrix laboratory, is a proprietary programming language developed by MathWorks. MATLAB allows for easy matrix operations and is equipped with several linear solvers, as well as built-in plotting and visualization features, that together make it very popular for general programming in academic settings. MATLAB’s easy-to-learn programming syntax makes it attractive to novice programmers, and its feature-rich environment makes it attractive to seasoned programmers, for both prototyping algorithms and research. Additionally, MATLAB has an interactive user interface that combined with MATLAB’s interpreter removes the hurdles of compiling, running, and visualization of the data. As such, MATLAB has been adopted as the companion language or the language of choice in several books [27,28,29,30,31,32,33,34,35]. MATLAB is also widely used in several research fields, including but not limited to, fluid mechanics [36, 37], geophysical studies [38,39,40], volcanology [41,42,43,44], astrophysics [45,46,47,48], chemical engineering [49,50,51,52], image analysis [53,54,55,56,57], neural networks [58,59,60], cell modeling [61,62,63,64], and cardiac studies [65,66,67,68,69,70,71]. MATLAB also provides GPU parallelism through fully automated GPU acceleration, the arrayfun command which applies a function to each element of arrays, and CUDA kernel calls.

Here, we implemented the arrayfun and CUDA kernel call options for each of the three models. The arrayfun function applies a MATLAB function to all elements of an array. After sending the arrays to the GPU using the gpuArray() function, calling the arrayfun function for each time step allows the function to be run on the GPU. MATLAB’s interpreter recognizes that it can run the function independently for each element of the array on the GPU and it will do so. Since this approach still relies on automatic detection of the parallelizable section and acceleration of the code, it is expected to be less than “ideal”. The second approach in MATLAB is to manually write the GPU code as a CUDA kernel and run the CUDA kernel. This approach is supposed to result in the best observed performance since there is no “guess-work” necessary by the MATLAB interpreter and the GPU code is already parsed. The upside is that all MATLAB visualizaton and data analysis tools still can be used, and the CUDA kernel will only be in charge of running the accelerated code in an optimum way. However, writing CUDA kernels requires familiarity with the NVIDIA CUDA C language in addition to familiarity with MATLAB. Hence, it is expected that a smaller number of MATLAB users will be comfortable programming CUDA kernels. Speedup is assessed for problems sizes of \(2^{16}\), \(2^{18}\), \(2^{20}\), and \(2^{22}\) grid points. By default, all implementations in MATLAB use double-precision variables.

Fig. 2.
figure 2

Speedup of models vs. grid size for the FHN, MM, and BR models using MATLAB with array functions (dashed lines) and with CUDA kernels (solid lines).

Figure 2 shows that in all cases, as expected, the use of CUDA kernels provided more substantial speedup than the corresponding arrayfun implementations by as much as a factor of six. The largest speedup was found for the BR model, which is not surprising given that it has the most equations and thus the most potential for concurrency within a given time iteration. Correspondingly, the FHN model attained the smallest speedups, but it still achieved a speedup of nearly two orders of magnitude for the largest grid size using CUDA kernels. The MM achieved speedups more than twice that of the FHN model, most likely because although it has twice the number of ODEs, it has a significant number of additional algebraic equations evaluated during each time step, thus allowing greater potential for performance increase through greater parallelization over each time step.

3.2 OpenACC

OpenACC is a programming standard that developed as a joint effort between Cray, CAPS, NVIDIA and PGI as an alternative to low-level CUDA programming. OpenACC, similar to OpenMP, uses a pragma-based approach to identify the computer code regions that can be parallelized on the GPU. The pragma directives, together with environment variables and library calls, facilitate accelerating regions of the serial C/C++ or FORTRAN CPU codes that can benefit from parallelization, typically loops, and in this case support use of GPGPU computing. As a result, OpenACC provides an approach that can accelerate mature CPU C/C++ or FORTRAN codes with minimal effort. This feature has made OpenACC an attractive choice to a large group of researchers in various fields including but not limited to fluid mechanics [72,73,74,75,76], earthquake modeling [77], deep neural networks [78, 79], astrophysics and data mining [80], cardiovascular [81] and cardiac electrophysiology [82, 83].

First, we implemented a serial version of all three models in the C programming language. This serial version was used in all speedup calculations. OpenACC pragmas were added to the serial code to achieve parallelism. After initializing the solution we used the OpenACC’s “data in” pragmas to copy the data to the GPU. Then in the parallel loops these arrays were marked as present on the GPU to avoid unnecessary copy of the arrays in and out of the GPU. The data was copied out to CPU memory only on the time-steps that we intended to write data to disk. This was achieved through the “update self” pragma. The data was written to disk only for debugging and during performance measurements no data was written to disk. Both single-precision and double-precision implementations were tested and the variations in performance were limited to less than \(10\%\) on this particular GPU. The results presented here were generated using double-precision variables.

Fig. 3.
figure 3

Speedup of models vs. grid size for the FHN, MM, and BR models using OpenACC directives in the C programming language.

Figure 3 shows the resulting speedups, which are quite similar to the speedups obtained using MATLAB with CUDA kernels. Again, the BR model benefited the most from acceleration, with speedups of up to three orders of magnitude due to the fact that, for the BR model, the computational cost of the reaction operations is much more than for the diffusion term. This suggests that the more complicated models of cardiac dynamics can benefit even further from the use of OpenACC implementations.

3.3 Python Numba Implementation

Python is an interpreted general-purpose language created in the early 1990s by Guido van Rossum at Stichting Mathematisch Centrum in the Netherlands as a successor of ABC [84]. Python supports multiple programming paradigms, including functional, object-oriented, and procedural programming. Due to its feature-rich environment and its approachable learning curve, Python is widely popular as a language for teaching [85, 86] and research in various field such as astrophysics [87,88,89], machine learning [90, 91], neural networks [92], and business [93]. Similarly, Python is also used in cardiovascular [94,95,96,97,98,99,100,101], cardiac electrophysiology [102,103,104,105,106,107], and arrhythmia detection [108] studies. Python’s popularity is evident in the large number of conferences that are held each year dedicated to Python programming including DjangoCon Europe, EuroPython, EuroSciPy, Kiwi PyCon, O’Reilly Open Source Convention, Plone Conference, PyCon conferences held in different regions of the world, PyData, PyGotham, SiPy and many more [109]. Project Jupyter has also contributed significantly to the popularity of the Python language by providing a web application that allows users to create and share documents that contain live interactive Python codes, equations, visualization and narrative text [110].

The widespread popularity of Python has resulted in a broad range of Python libraries that can be used for various different applications. One very popular library is NumPy, which provides support for definition of multi-dimensional array objects and array operations [111], which can be very useful in scientific computing. Numba is an open-source Just-In-Time (JIT) compiler that translates a subset of Python and NumPy code into accelerated machine code [112]. The Numba compiler can provide acceleration through multicore CPUs or GPGPU. Numba has a very simple approach to accelerating Python code. In fact, the Numba website provides a tutorial that teaches Python programmers to start accelerating their Python code in as little as five minutes [113]. This small learning curve makes Numba an attractive choice for cardiac modeling.

Fig. 4.
figure 4

Speedup of models vs. grid size for the FHN, MM, and BR models using Python Numba.

Figure 4 shows speedup results using Python-Numba. All measurements were carried out using double-precision variables. As expected, the BR model achieves the greatest speedup and the FHN model the least. Most notable is that large performance gains do not appear until grid sizes of \(2^{20}\), which may be due to the fact that for each time step there is a certain overhead time imposed for launching a parallel code on the GPU. However, that effect becomes less important when we keep the GPU busy for longer periods before advancing to the next time step.

3.4 TensorFlow Implementation

TensorFlow is a free and open-source software library primarily designed by the Google Brain team [114, 115] for internal use. It was released for public use later under Apache 2.0 open-source licence on November 9, 2015 [116]. It has been used by a number of companies, including Airbnb, AIRBUS, Coca-Cola, Google, Intel, PayPal and Qualcomm [117] in both research and production. TensorFlow can be used on single as well as multiple CPUs and GPUs. Once the target device is chosen, the parallelization is carried out automatically by the TensorFlow engine. TensorFlow provides extensive features to be used for machine learning and deep neural network applications [114, 115, 118]. Hence, it can be considered an attractive choice for model-based machine-learning environments where machine-learning algorithms can be trained using a dynamical numerical model. A number of groups in the cardiac community have embraced TensorFlow [119,120,121,122,123,124,125,126,127].

Fig. 5.
figure 5

Speedup of models vs. grid size for the FHN, MM, and BR models using TensorFlow.

Speedup results for the three models using Tensor-flow are given in Fig. 5. The measurements for TensorFlow were made using single-precision variables, as some of the functions did not have a double-precision implementation for GPU parallelism at the time of coding the TensorFlow programs. In this case, speedup is quite limited compared to the other approaches considered, with the maximum speedup (attained for the BR model on the largest grid) still well below 100. The speedups are the result of just choosing the target device to be the GPU. No optimization such as using convolutions was used here. We would say the effort required for parallelism on the GPU was minimal compared to other languages. Given the minimal effort required for achieving parallelism, programmers are encouraged to use the GPU as their target device for all models, especially more complex ones.

3.5 WebGL and Abubu.js Implementation

WebGL or the Web Graphics Library is a royalty-free JavaScript application programming interface standard that provides low-level 2D and 3D rendering capabilities in modern web browsers without the need to install any plug-ins through the HTML5 canvas element [128]. This means that WebGL applications can run on any modern web browser and on any major operating system (such as Microsoft Windows, macOS, Linux, Android, or even iOS), and at the same time harness the computational power of the available GPU on that device. WebGL applications are automatically compiled at run-time for the particular user’s graphic cards. Therefore, the WebGL applications do not need to be compiled by the developers for all the intended GPUs and operating systems. This also means that WebGL applications are capable of harnessing the computational power in various GPU devices from various vendors, unlike some of the languages such as NVIDIA CUDA, which can only run on specific hardware.

The heart of the WebGL applications is written in OpenGL Shading Language (GLSL), which is a high-level programming language with a syntax based on the C programming language [129]. GLSL supports most of the C/C++ familiar structural components, such as if statements, for loops, etc. It also has a number of built-in functions for mathematical, vector, and matrix operations as well as texture access [130]. The only drawback for using WebGL is that currently it only supports single-precision variables and textures. Therefore, for applications that must use double-precision floats, WebGL is not suitable at present. When using WebGL for parallelism, usually texture memory is utilized as the basic data structure for the input and output of the programs [131]. However, the WebGL language can have a high learning curve for novice programmers or those who are not well versed in graphics programming. The Abubu.js programming library is used to address this issue and remove the hurdles of GPGPU programming with WebGL [6]. Using Abubu.js, WebGL has been shown to be capable of solving a wide range of problems from studying fractals, solitons and chaos [132] to cardiac dynamics, fluid mechanics, and crystal growth [6].

Fig. 6.
figure 6

Speedup of models vs. grid size for the FHN, MM, and BR models using WebGL together with the Abubu.js library.

In this work, we followed the methods proposed in [6] to implement the FHN, the MM, and the BR model in WebGL using Abubu.js. Figure 6 shows performance gains using our WebGL implementation, which generally outperforms all other implementations. In particular, the speedup for the MM is now above 1000 for the largest grid size, and for the BR model speedup exceeds 2000 for a grid size of \(2^{20}\). However, performance for the BR model is more variable, with a dropoff in speedup at the largest grid size in contrast to monotonic increases with grid size in all other cases. In addition, WebGL performance for the smallest grid size is typically no greater than that seen in OpenACC and MATLAB with CUDA kernels. That is due to the fact that there is a minimum overhead time to launch the WebGL applications in each time step. The performance drop in the BR model for larger grid sizes could be due to memory access bottlenecks and how the data is stored on the GPU. It should be noted that even with the performance drops, the WebGL applications outperform all other implementations by a large margin for larger domains.

3.6 NVIDIA CUDA Implementation

One of the most popular platforms to solve PDEs in parallel using GPUs is CUDA. CUDA is a parallel platform developed by NVIDIA that allows the user to execute programs on the GPU of a personal computer. This allows faster processing and visualization of large data sets that fulfill certain characteristics that will be discussed below. Since its launch in 2007, CUDA has helped to extend the use of GPU technology to the scientific community. Specifically, the CUDA platform has been applied in several scientific and engineering fields such as fluid dynamics [133, 134], machine learning and neural networks [135,136,137,138], astrophysics [139,140,141,142], the Lattice Boltzmann method [143,144,145], molecular dynamics [146,147,148], clinical applications [149, 150], and recently in the cardiac modeling community [151,152,153,154,155]. CUDA has also been successfully used for teaching purposes, including in undergraduate workshops [156]. Like all computational tools, it has advantages and disadvantages.

The CUDA platform is an extension of other programming languages, i.e., it is a set of functions added to a preexisting platform that allows the user to communicate with the GPU. This implies that most of the base language characteristics and logic will be inherited by the parallel functions. There are several versions of CUDA, mainly C/CUDA (CUDA for the C language), PyCUDA (CUDA for Python) and CUDA Fortran. We decided to develop our solvers in C/CUDA because it is the most supported version. The description below is valid regardless of the version chosen.

In some cases, CUDA is able to launch millions of processing threads simultaneously, which can increase the speed of computations and save many hours and possibly days of processing time (). The speedup depends mainly on the type of algorithm implemented to process the data and the structure of the data. To understand better how these factors affect the speed of the computations, it is important to understand the interaction between the software and hardware, particularly the interaction between the CPU (commonly referred to as the host) and the GPU (commonly referred to as the device). All programs start at the host level, meaning that they are all managed by the host and all the data is held in CPU RAM or the hard drive. Meanwhile, sections of memory in the GPU are reserved to hold the data that needs to be processed. Once everything is ready, the CPU calls specific functions to be executed on the GPU. After the GPU processes the data, it must be sent back to the CPU so that it can be post-processed by the user. In most programs, there is a constant exchange of data between host and device. As a rule of thumb, the programmer should try to reduce the number of memory transactions between both ends mainly due to bandwidth limitations. Other factors to be considered are the GPUs memory capacity and frequency of kernel calls (functions called by the CPU that execute on the GPU and hold the bulk of the processing algorithm). In addition to adequately controlling the data flow, the program must manage the data in a parallel-friendly arrangement, specifically, we must determine the way that data will be read and written. As commonly observed in programming languages with arrays of two or more dimensions, the data layout and memory access patterns need to be aligned to achieve maximum performance. More specifically, CUDA requires the data layout to adapt to a single instruction multiple data processing structure, which means that all processing threads must be performing the same instructions simultaneously to avoid thread divergence.

In addition to the memory transactions and layout mentioned above, CUDA requires the programmer to adapt the data to a specific hierarchical structure of threads. In general, threads are organized into blocks, which can be one-, two-, or three-dimensional. Sets of blocks are then organized in a grid. Again, the grid can be arranged in all three dimensions. More information can be found in [157] and [158]. The dimension of these objects refers to how they will be accessed, not how they are physically stored in memory. The user can adapt this structure to increase the performance of their computations. In our particular case, two-dimensional blocks and grid resemble very well the 2D domains in which we are solving the PDEs. Still, different memory access patterns will influence the speed of our computations. Other factors to be considered when building a CUDA program are coalesced memory patterns, in which multiple threads can receive data through a single combined memory access, and the use of the various types of internal memories and the interaction among them. These are just some of many considerations that are important to keep in mind. It is also worth noting that if a task is not inherently parallelizable due to dependencies across loop iterations without substantial work within each iteration (such as a Fibonacci sequence calculation) or if the number of threads is small (typically on the order of hundreds or lower), CUDA will perform worse than most standard serial implementations due to overhead associated with launching kernels and moving data between the host and device.

In our implementations, we used global memory. Both single- and double-precision implementations were tested and the variations in performance were limited to less than \(10\%\) on this particular GPU. The results presented here are generated using double-precision variables. One-dimensional arrays were used to represent the 2D domain. The data could be arranged in either a row-major or a column-major fashion in the one-dimensional arrays. In the row-major structure, the matrix is stored in the 1D array one row after another until the entire matrix is stored. In the column-major order, the same procedure is followed for the matrix columns. Both versions were tested to observe the performance differences. The row-major version of the data-structures performed consistently better than the column-major structures. This could be due to the fact that the row-major structure was more compatible with hardware, possibly due to the way that warps are organized on this GPU. Different results might be expected for different GPUs and the users should be aware of such differences. It should be noted that this should not be confused with the loop access of multi-dimensional arrays in CPUs. Here, in both cases, the data structures are one-dimensional and the central difference algorithm for the diffusion term imposes a symmetry condition on both directions.

We also decided against using shared-memory implementations such as those suggested in earlier studies [152]. The use of shared memory requires copying the variables from the global memory into shared memory, performing calculations from shared memory in registered memory (implicit), then writing data into shared memory, and then to global memory. These steps are required in each time step as no data can be retained between time steps. However, the use of global memory would require bringing the data to register for calculations and writing the data back to global memory. It is evident that using global memory for this type of problems involves fewer memory transactions compared to the shared memory implementations but the same number of global memory accesses and thus is expected to be faster. Additionally, our goal in this study is to compare the simplest implementations in each language as the targeted programmers are scientists whose primary expertise is not GPU programming. The use of texture memory instead of one-dimensional arrays could also change the performance of the applications. However, any performance improvements could be hardware-dependent and would also depend on the problem size and complexity. In favor of simplicity, we chose the use of a global memory implementation.

We used a \(16\times 16\) thread size for the CUDA implementations, and each direction was then divided by 16 to get the block size. Smaller thread sizes led to lower performance and larger thread sizes did not improve the performance on our particular GPU.

Fig. 7.
figure 7

Speedup of models vs. grid size for the FHN, MM, and BR models using CUDA with column-major data structures in dashed lines and row-major data structures in solid lines.

Figure 7 shows the speedup achieved for each of the three models using our CUDA implementation. CUDA slightly outperforms MATLAB with CUDA kernels and OpenACC, especially for smaller grids, but overall the performance is fairly comparable for these three implementations. WebGL maintains better performance for all grid sizes. Note that speedup seems to have saturated for the FHN model and appears to be close to saturating for the other models. In addition, it should be noted that we could potentially observe a performance saturation similar to those observed in WebGL implementations. Moreover, due to the limitations in the GPU memory size, there is a limit to the problem size that can be handled on a single GPU so that using multiple GPUs for larger problems becomes inevitable. While using multiple GPUs can be useful for handling larger domains due to memory constraints, it should be noted the required communication between the multiple GPUs will impose performance penalties on the parallel GPU codes.

Fig. 8.
figure 8

Speedup comparison for various implementations of the FHN, MM, and BR models. Each color corresponds to a different grid size.

4 Discussion and Conclusion

Figure 8 shows the comparison between the speedup gains for each of the GPU implementations of the three different models with different grid sizes. It can be seen that the WebGL applications outperform all other implementations for all cases except for the smallest grid sizes and the FHN model. As soon as the workload on the GPU is “large” enough to take full advantage of concurrency, WebGL provides the best performance. All implementations performed better with larger grid sizes and more complicated models, with the BR model implementations providing the best performances among all models. Another notable observation is that almost all GPU implementations provided performance comparable to that of the NVIDIA CUDA implementations with minor differences with the exception of TensorFlow. Therefore, we can conclude that almost all languages considered in this study are ready to make effective use of GPU hardware to reduce program runtimes. The least effort for achieving parallelism in the languages was required by TensorFlow, C-OpenACC, MATLAB arrayfun, and then Python Numba implementations. However, writing the serial code in TensorFlow was the most convoluted of all the approaches tested. Nevertheless, moving from the serial code to the accelerated GPU code was as simple as just choosing the target device. C-OpenACC was the most natural for a novice programmer, which could provide the best performance with the least programming effort. However, MATLAB and Python Numba provide built-in visualization tools.