1 Introduction

Mathematical modeling and computer simulation is now primarily focused on nonlinear dissipative phenomena in chemistry and biology [1], rather than in physics. Meanwhile, the conventional mathematical models, based on differential calculus, are sometimes not capable to simulate nonlinear, dissipative processes on micro- or nano-level of resolution, which is requested in many chemical, biological and microelectronic investigations. A large class of such tasks deals with “reaction–diffusion” processes, and accordingly their are represented by “reaction–diffusion” (RD) models. In continuous mathematics such models are given by parabolic partial differential equations, usually having nonlinear reaction terms. Numerical solution of such kind of equations requires some transformations to be done for obtaining a linear approximation, and to overcome difficulties in parallel implementation. Problems arise from the incompatibility between continuous mathematical models intended for sequential solution, on one hand, and discrete data and parallel operation in modern computer on the other hand [2]. These problems stimulate the development of new approaches to spatial dynamics modeling [1], which are based on the following principles:

  • time, space, and variables should be integers (including Boolean), real numbers allowed only for computing auxiliary values (probabilities, conditions);

  • all time-independent operations should be explicitly expressed in the model;

  • computer simulation algorithm should allow for visualization the resulting spatial function in real time.

These principles form the background of many modern mathematical models for computer simulation reaction–diffusion phenomena,

in quite different fields of sciences. The most advanced are devoted to the investigation of pattern formation [3], surface catalytic reactions [4] and microelecronics [5, 6], where they are called “Kinetic Monte Carlo method”, according to their origin from stochastic computation theory.

Kinetic Monte-Carlo methods in chemistry [4, 7] and in microelectronics [5], probabilistic cellular automata in material science [3, 6, 8], self-organizing CA in biology [9, 10]. Actually, all of them may be considered as asynchronous cellular automata [11], simulating phenomena which are composed of movements and transformations [12] of certain kind of micro entities, such as molecules or nano particles either real or abstract. Accordingly, the model should consist of a number of simple local operators being applied to randomly chosen sites of the discrete space. Organization of particles interactions in space and time (mode of operation) is not necessarily completely asynchronous.

Whereas there exists some experience in ACA simulation [1214], there is no systematic methodology for synthesize a model of a complex phenomenon given by a description of involved elementary actions and their interactions. The main problem is in constructing a global operator given elementary local functions, which provides the best adequacy to the process under simulation on one hand, and satisfies computational correctness and efficiency conditions on the other hand. Since ACA simulation usually aims to explore a process which is not yet known in detail, the choice of local operators composition is sometimes made ad hoc. So, it is worth to know how the resulting ACA evolution depends on the chosen composition mode, to make model synthesis more effective.

Moreover, investigation of chemical and biological processes usually aims at studying mechanisms of the phenomena at micro (or nano) level [15]. So, parallel implementation conditions should be taken into account.

It means that asynchronous behavior should be partially synchronized, but in such a manner that does not cause distortion of the process. The synchronization is induced [16] by transforming asynchronous mode into an equivalent block-synchronous one. In [12] it is shown on a real life example of a heterogeneous catalytic simulation, that such a transformation induces no errors in asynchronous CA behavior. This fact provokes the following question: how much of synchronization may be admissible in reaction–diffusion CA. Just this question is to be answered in the investigation reported in the paper. Based on the available experience, the main modes of operations are distinguished. For them the amount of additional computations needed for parallel computing and the amount of data exchanges are assessed, and computer simulation of a typical reaction–diffusion process is performed. Simulation program was implemented on the cluster NKS-30 of Siberian Supercomputer Center of Siberian Branch of Russian Academy of Science. Results of its parallel implementation on a multiprocessor with distributed memory are presented.

2 Formal representation of cellular automata models

Since the investigation is aimed to create methods for efficient simulation of spatial dynamics, and final product should be a parallel computer program, the formalism used to represent cellular automata models is a rule-based system, called parallel substitution algorithm [17]. This form of model representation corresponds to the CA concept, and is easily transformed into an imperative programming language, when implemented on a conventional computer.

2.1 Definition of cellular automata concepts

CA is defined by a tripple \(\aleph =\langle A,X,\varTheta (X) \rangle \), where A is a set of symbols, called alphabet, which may be interpreted as designations of substances involved in the process under simulation. \(X=\{\mathbf {x}\}\) is a set of cell names, usually given by a set of coordinates of a finite discrete space with \( \mathbf {x} = (i,j,k), i=0,\ldots ,I, j=0,\ldots ,J, k=0,\ldots ,K\). A pair \((u,\mathbf {x})\), \(u \in A\), \(\mathbf {x}\in X\). The set of cells \(\varOmega =\{(u_k,\mathbf {x}_k)| \mathbf {x}_k\ne \mathbf {x}_l\}\) forms a cellular array \(\varOmega _A=(u_1(\mathbf {x_1}),u_2(\mathbf {x_2}), \ldots ,u_{M}(\mathbf {x}_{M}))\), \(M=|X|\), being referred to as a global state. \(\varTheta (X)\) is a global operator, whose application to \(\varOmega (t)\) transforms it to \(\varOmega (t+1)\), the transformation being considered as an iteration. The sequence of global states, \(\varSigma (\varOmega _A(0)) =\varOmega _A(0), \ldots \varOmega _A(t) \ldots \varOmega _A(\hat{t}) \), obtained by iterative application of \(\varTheta (X)\) to \(\varOmega _A(t)\) is called a CA evolution, \(\hat{t}\) being the final iteration.

An application of \(\varTheta (X)\) is the result of performing local operations at all cells in the array. Local operation is given as a substitution of the form

$$\begin{aligned} \theta (\mathbf {x}):S(\mathbf {x}) \mathop {\longrightarrow }\limits ^{p}S^\prime (\mathbf {x}), \end{aligned}$$
(1)

where \( S(\mathbf {x})= (u_0, \ldots , u_{n-1}) \) and \(S^\prime (\mathbf {x})= (v_0, \ldots , v_{m-1}) \), \(u,v \in A\), \(m \le n\), are local configurations, expressed as vectors whose components are states of cells allocated in the corresponding underlying templates: \( T(\mathbf {x})\) and \(T^\prime (\mathbf {x})\), such that

$$\begin{aligned} T(\mathbf {x})=\{\mathbf {x},\mathbf {x}+\mathbf {a}_1,\ldots , \mathbf {x}{+}\mathbf {a}_{n-1}\},\quad T^\prime (\mathbf {x}{=}\{\mathbf {x},\mathbf {x}{+}\mathbf {a}_1,\ldots , \mathbf {x}{+}\mathbf {a}_{m-1}\}, \quad m\le n, \end{aligned}$$
(2)

where \(\mathbf {a}_j\) is a shift vector such that the cell \(\mathbf {x+a_j} \in T(\mathbf {x})\) has the state \(u_j\), and the cell \(\mathbf {x+a_j} \in T^\prime (\mathbf {x})\) has the state \(v_j\). Maximal component of \(a_j\) is called the template radius R(T).

Application of \( \theta (\mathbf {x}) \) to a certain \(\mathbf {x_k} \in X\) replaces the states \(\{u_0,\ldots , u_j ,\ldots , u_m\} \in S(\mathbf {x_k})\) by the states

$$\begin{aligned} v_j=f_{j}(u_0,\ldots , u_j \ldots ,u_n), \quad j=0,\ldots ,m, \quad n=|S(\mathbf {x_k})|, \end{aligned}$$
(3)

which are values of the transition function \(f_{j}(u_1,\ldots , u_{n})\), p is a probability of \(\theta (\mathbf {x})\) application.

For providing correct functioning \(\varTheta (X)\) must satisfy the following condition: no pair of substitution application should aim at adjusting the same cell at the same time, i.e.,

$$\begin{aligned} T^{\prime }(\mathbf {x}) \bigcap T^\prime (\mathbf {y}) = \emptyset \quad \forall (\mathbf {x},\mathbf {y}) \subset X. \end{aligned}$$
(4)

If the global operator \(\varTheta (X)\) deals with a single substitution \(\theta (x)\), then it is called a simple global operator. Accordingly, a CA with simple \(\varTheta (X)\) is a simple CA. An iteration in a simple CA consists of application \(\theta (\mathbf {x})\) to all \(\mathbf {x} \in X\), the order of cells to be defined by the mode of operation \(\rho \in \{\sigma , \alpha ,\beta \}\). The following modes of operation are used when the CA model is implemented on a single conventional computer or many of them working in parallel.

Synchronous mode, \(\rho =\sigma \), prescribes the following algorithm of application \(\varTheta (X)\) to \(\varOmega (t)\):

1. for all cells \(\mathbf {x} \in X\), chosen in any order the next state \(v(\mathbf {x})\) is computed by (3), and

2. all cells are adjusted at once transforming \(\varOmega (t)\) into \(\varOmega (t+1)\).

In synchronous CA the correctness condition (4) is satisfied, when \(|T_k^\prime (\mathbf {x})| \le 1\). It is a strong constraint in CA model synthesis, but simplifies parallel implementation of a large scale CA on a cluster. Among the synchronous CA sometimes used as components in reaction–diffusion simulation, adsorption (desorption) CA, phase-transition (evaporation, freezing), etc.,are to be mentioned. For example, the “sticking process“, which makes a mobile particle (\(u=1\)) to become immovable (\(u=b\)) and sticked to another immovable one, is represented by the probabilistic substitution

$$\begin{aligned} \theta _{st} (\mathbf {x}): \{ (1,\mathbf {x}),(b,\mathbf {x}+\mathbf {a}_l )\}, \mathop {\longrightarrow }\limits ^{p_{st}}\{(b,\mathbf {x})\}, \end{aligned}$$
(5)

where \(\mathbf {x}+\mathbf {a}_l \in T(\mathbf {x})\) , \( p_{st}\) is a probability of \(\theta _{st} (\mathbf {x})\) application.

Asynchronous mode, \(\rho =\alpha \), suggests to execute \(\varTheta (X)\) as follows:

1. a cell \(\mathbf {x} \in X\) is chosen from \(\varOmega (t)\) with probability \(p=1/|X|\),

2. next states of cells \(\mathbf {x}\in T^{\prime }(\mathbf {x})\) (1) are computed,and adjusted immediately,

3. pp. 1. and 2. are executed |X| times, completing the t-th iteration.

Most CA simulating particles movements, such as convection and diffusion processes, as well as the chemical reactions between molecules in adjacent cells, are asynchronous. The typical one is the so called “naive diffusion“, an essential component of any reaction–diffusion model [18, 19], which performs an exchange of states between \(\mathbf {x}\) and a randomly chosen one from its neighborhood, is as follows

$$\begin{aligned} \begin{array}{ll} \theta _d(\mathbf {x}):&{} \{(u_0,\mathbf {x}+\mathbf {a}_0),\dots ,(u_l,\mathbf {x}+\mathbf {a}_l), \ldots , (u_d,\mathbf {x}+\mathbf{a}_{2d})\}\mathop {\longrightarrow }\limits ^{p_d}\\ &{} \{(u_l,\mathbf {x}+\mathbf {a}_0),\dots ,(u_0,\mathbf {x}+\mathbf {a}_l), \ldots , (u_d,\mathbf {x}+\mathbf{a}_{2d})\},\\ \end{array} \end{aligned}$$
(6)

where \(l\in \{1,\ldots ,2d\}\), d being the space dimension, is a randomly chosen number, \(\mathbf {a}_l\) are orts shifting \(\mathbf {x}_l\) to its lth neighbor, \(p_d\) is a probability of \(\theta _d(\mathbf {x})\) application.

In reality, asynchronous computation process is not divided into iterations. The concept of iteration is accepted conditionally for making the comparison with synchronous case more conceptual. Unlike the synchronous case, asynchronous computation on one processor is always correct due to its serial character, but its implementation on parallel processor is absolutely inefficient.

Block-synchronous mode, \(\rho = \beta \), combines synchronous and asynchronous modes in a single algorithm of global operator execution by representing the iteration as a number of sequential synchronous, but randomly chosen stages. \(\beta \)-mode was introduced in [16] to make asynchronous CA parallelization more acceptable relying on the fact that synchronous CA parallelization is efficient, which is in contrast to the asynchronous case.

The block-synchronous mode is based on a partition \(\varPi =\{\varPi _1, \ldots , \varPi _m\}\) of X, as follows.

  1. 1.

    A block \(B(\mathbf {x})\) containing m adjacent cells is chosen, such that

    $$\begin{aligned} |B(\mathbf {x})|=m, \quad B(\mathbf {x}) \supseteq T^{\prime }(\mathbf {x}). \end{aligned}$$
    (7)
  2. 2.

    A partition \(\varGamma =\{B(\mathbf {x}_1), \ldots , B(\mathbf {x}_M)\}\) is constructed having \(M=N/m\) congruent blocks \(B(\mathbf {x}_i)\), such that for all \((\mathbf {x_i}) \in \{\mathbf {x}_1,\ldots ,\mathbf {x}_M\}\)

    $$\begin{aligned} \forall (i \ne j): B(\mathbf {x}_i) \cap B(\mathbf {x}_j) =\emptyset , \quad \quad \bigcup _{i=1}^M{B(\mathbf {x}_i)}=X. \end{aligned}$$
    (8)
  3. 3.

    A partition \(\varPi =\{\varPi _1, \ldots , \varPi _m\} \) orthogonal to \(\varGamma \) is determined whose subsets, being processed synchronously in any order, execute the computation of \(\theta (X)\) correctly.

Correctness of \(\beta \)-mode results from the following property of orthogonal partitions \(\varGamma \) and \(\varPi \),

$$\begin{aligned} |\varPi _k(\mathbf {x}) \cap B_j(\mathbf {x})| =1, \quad k=1,\dots , m, \quad j=1,\ldots ,M, \end{aligned}$$
(9)

which provides that any pair of cells \((\mathbf {x}_i,\mathbf {x}_j )\), belonging to one and the same \(\varPi _k(\mathbf {x})\) according to (7) are separated by the distance \(d_{ij}>R (T)\), hence, allowing to be updated simultaneously, i.e., synchronously, without violating (4).

Investigation of computational properties of block-synchronous CA [2, 13] showed that block-synchronous CA are significantly less time consuming than asynchronous, enhancing to use it in sequential cases as well. The gain in time is due to the difference in random generator (RG) calls for choosing \(\theta \) application cells, which are in \(\alpha \)-case and in \(\beta \)-case as follows.

$$\begin{aligned} E_{\mathrm{RG},\alpha } =N, \quad E_{\mathrm{RG},\beta }=m, \end{aligned}$$
(10)

Taking into account, that the time of a RG call may be about two orders larger than that of a substitution application, the economy of time with \(\beta \)-mode may be significant.

2.2 CA models of complex phenomena

The conception of simple CA is too poor to be used for simulation complex phenomena, particularly, reaction–diffusion processes because of the following reasons.

  1. 1.

    In complex processes several species are involved. Hence, the CA alphabet (usually symbolic one) should be expanded, and the global operator \(\varTheta (X)\) should be composed of several substitutions, being referred to as a \(\theta \)-composition. Accordingly, a CA with a complex \(\varTheta (X)\) is called complex CA. Reaction–diffusion CA are always complex ones, containing at least two substitutions: \(\theta _\mathrm{reac}(\mathbf {x})\) and \(\theta _\mathrm{diff}(\mathbf {x})\).

  2. 2.

    Natural phenomena behavior is usually completely or partly stochastic both in time and in space. In fact, it is not known exactly how the process of particles interactions really proceeds. Hence, when designing the CA model a researcher should have the opportunity to use different types of \(\theta \)-composition and chose the most plausible.

Formal representation of a complex CA is similar to that of the simple one, differing in interpretation of two concepts of CA definition: the alphabet, and the global mode of operation. The latter is defined not only by \(\rho \in \{\alpha , \beta , \sigma \}\) determining the way of choosing cells to be updated, but also by a type of \(\theta \)-composition indexed by \(\mu \in \{\delta ,\gamma \}\), which indicates the way of choosing \(\theta (\mathbf {x}) \in \varTheta \) to be applied, either deterministically (\(\delta \)) or randomly (\(\gamma \)).

$$\begin{aligned} \varTheta (X)=\varPhi _{\rho ,\mu }(\theta _1(\mathbf {x}), \ldots , \theta _n(\mathbf {x})). \end{aligned}$$
(11)

No matter what global mode of operation is used an iteration is suggested to consists of

$$\begin{aligned} E=|X| \times |\varTheta |=N\cdot n \end{aligned}$$
(12)

acts of \(\theta \)-aplications.

The following types of global mode of operation are mainly used.

Asynchronous stochastic mode \(\varPhi _{\alpha ,\gamma }\) is an idealization of maximally randomized representation of particles behavior on micro-level. In chemistry [46] such a model is called “ Monte-Carlo simulation method”, in [12] this model forms a class of stochastic CA. Global operator of such a CA is computed as follows:

  1. 1.

    a cell \(\mathbf {x} \in X\) is chosen with the probability \(p=1/|X|\),

  2. 2.

    a substitution \(\theta _j \in \varTheta \), \(j=1,\ldots ,n\), is chosen randomly according to the probabilities \(p_j\), obtained using the \(\theta \) intensities \( r_1, \dots ,r_n\):

    $$\begin{aligned} p_j=\frac{r_j}{r_1+r_2+\cdots ,r_n}, \end{aligned}$$
    (13)

    and applied to \(\mathbf {x}\) immediately,

  3. 3.

    pp. 1, 2. are repeated E times.

Asynchronous deterministic mode \(\varPhi _{\alpha ,\delta }\) has the global operator computed as follows:

  1. 1.

    a cell \(\mathbf {x} \in X\) is chosen with the probability \(p=1/|X|\),

  2. 2.

    the substitutions \(\theta _j \in \varTheta \), \(j=1,\ldots ,n\), are applied to \(\mathbf {x}\) in a prescribed deterministic order.

Block-synchronous stochastic mode \(\varPhi _{\beta ,\gamma }\) is used in large-scale CA to be implemented on a supercomputing cluster, or, sometimes, when computing time is wanted to be reduced. Computation of \(\varTheta (X)\) is performed as follows.

  1. 1.

    A block \(B(\mathbf {x})\) containing \(m_{\beta }\) cells is chosen, such that

    $$\begin{aligned} B(\mathbf {x}) \supseteq T^{\prime }_{\varSigma }(\mathbf {x}), \quad T^{\prime }_{\varSigma }(\mathbf {x}) = \bigcup _{i=1}^n{T_j^{\prime }}, \quad j=1,\ldots ,n. \end{aligned}$$
    (14)

    where \(T_j^{\prime }\) is the underlying template of \(\theta _j \in \varTheta \), and the size of \(B(\mathbf {x})\) is convenient to represent by its radius R, such that

    $$\begin{aligned} m_{\beta }=|B(\mathbf {x})|=(2R_{\beta }+1)^D, \quad m_{\beta } \ge |T^{\prime }_{\varSigma }(\mathbf {x})|. \end{aligned}$$
    (15)
  2. 2.

    The partitions \(\varGamma =\{B(\mathbf {x}_1), \ldots B(\mathbf {x}_{M_{\beta }|})\}\) and the orthogonal to it partition \(\varPi =\{\varPi _1, \ldots , \varPi _{m_{\beta }}\} \) are constructed according to (8).

  3. 3.

    At each subset \(\varPi _i\), \(i=1,\ldots ,|M_{\beta }|\), for all cells \(\mathbf {x} \in \varPi _i\) the following is done n times: a substitution is chosen according to probabilities (13) and immediately executed.

Sequential global mode \(\varPhi _{\rho ,\delta }\) is a superposition of simple global operators \(\varTheta _j(X)\), \(j=1,\ldots ,n\), each operating in its own mode \(\rho _j\) but mostly used when \(\rho =\sigma \) for all \( \varTheta (j)(X)\), i.e.

$$\begin{aligned} \varTheta (X) = \varTheta _{n}(\varTheta _{n-1,} (\ldots \varTheta _1)(X))). \end{aligned}$$
(16)

2.3 The concept of CA stochasticity

For investigating the interdependence between stochastic character of a CA behavior and its simulation performance on supercomputers, a quantitative measure of CA stochasticity is essential. Let the number of random operations in an iteration be denoted by \(E_\mathrm{rand}\) and the number of deterministic ones by \(E_\mathrm{det}=E-E_\mathrm{rand}\). An operation is considered to be a random one, if its application is determined by a random number. So, asynchronous choice of a cell \(\mathbf {x} \in X\) is a random operation, no matter how many random numbers are used for obtaining \(\mathbf {x}\), since \(\mathbf {x}\) may be represented by two or three coordinates in a discrete space. When \(\theta _j(\mathbf {x}) \in \varTheta \), \( j=1,\ldots ,n\), to be applied to \(\mathbf {x}\) is chosen randomly according to (13), then its application belongs to \(E_\mathrm{rand}\). But, the deterministic application of a probabilistic \(\theta (\mathbf {x})\) is not considered to be random, although it requires the use of RNG.

Stochasticity \(\lambda \) of a CA is a fraction of random operations in a computation of \(\varTheta (X)\),

$$\begin{aligned} \lambda =E_\mathrm{rand}/E. \end{aligned}$$
(17)

Stochasticity depends both on the mode of CA operation (\(\rho \)) and on the type of \(\theta \)-application mode (\(\mu \)), the first one being based on the spatial randomness and referred to as spatial stochasticity. The second one, being based on random application order is temporal stochasticity.

From the definitions of global operator modes, given in 2.3 \(E_\mathrm{rand} \) is easily obtained.

  • For \(\varPhi _{\alpha ,\gamma }\): each cell \(\mathbf {x}\in X\) is randomly chosen \(n=|\varTheta |\) times, each time being adjusted by a randomly chosen \(\theta (\mathbf {x})\). So, \(E_\mathrm{rand}=N\cdot n = E\).

  • For \(\varPhi _{\alpha ,\delta }\): to each randomly chosen cell \(\mathbf {x}\in X\), all substitutions \(\theta (\mathbf {x})\) are applied deterministically in a prescribed order. Hence \(E_\mathrm{rand} =N\).

  • For \(\varPhi _{\beta ,\gamma }\): a stage \(\varPi _j \in \varPi _j\), \(j=1,\ldots ,m\) is randomly chosen, for each \(\mathbf {x}\in Pi_j\), chosen deterministically in any order the following is done \(n=|\varTheta (X)|\) times: a substitution \(\theta _i \in \varTheta \), \(i=1,\ldots ,n\), is chosen with probability calculated according to (13), applied to \(\mathbf {x}\) and executed. It is repeated \(m_{\beta }\) times. \(E_\mathrm{rand}=m\cdot n\).

  • For \(\varPhi _{\beta ,\delta }\): a stage \(\varPi _j \in \varPi \), \(j=1,\ldots ,m\) is randomly chosen, to each \(\mathbf {x}\in \varPi _{j}\), chosen deterministically all \(\theta \in \varTheta \) are applied in any order. So, \(E_\mathrm{rand}=m\).

  • For sequential global mode \(\varPhi _{\rho ,\delta }\): \( E_\mathrm{rand}\) may be calculated as a sum of random operations over all \(\varTheta j(X)\), \(j=1,\ldots ,n\). If all \(\varTheta _j(X)\) are synchronous, \(E_\mathrm{rand}=0\).

Hence, stochasticity of all types of complex CA is as follows.

$$\begin{aligned} \lambda _{\alpha ,\gamma }=1, \quad \lambda _{\alpha ,\delta }=1/n, \quad \lambda _{\beta ,\gamma }=m_{\beta }/N, \quad \lambda _{\beta ,\delta }=m_{\beta }/(N\cdot n), \quad \lambda _{\sigma ,\delta }=0,\qquad \end{aligned}$$
(18)

3 Parallel implementation of complex CA

3.1 Parallelization efficiency via stochasticity

Implementation of large-scale cellular automata models on multiprocessor clusters is based on the domain decomposition method, which consists in dividing the cellular array \(\varOmega \) into P parts, referred to as subdomains, \(\varOmega =\omega _0 \cup \cdots \cup \omega _{P-1}\), and allocating them onto P processors, working in parallel. In order to provide interprocessor data exchange the subdomains have to be supplemented by V peripheral cells, called shadow cells,

$$\begin{aligned} V=2D\cdot R_{\varSigma }\cdot H^{D-1}, \end{aligned}$$
(19)

where \(D \in \{1,2,3\}\) is the domain dimension, \(R_{\varSigma }\) is the united template radius (14), and H is the linear size of the domain.

Let us assess weak parallelization efficiency as follows

$$\begin{aligned} \eta _{\rho ,\mu }= \frac{\mathrm{Time}_1(\omega )}{\mathrm{Time}_P(\varOmega )}=\frac{\mathrm{Time}_1(\omega )}{\mathrm{Time}_1(\omega ) + l _{\rho ,\mu }\cdot \mathrm{Time}_\mathrm{ex}}, \end{aligned}$$
(20)

where \(\mathrm{Time}_1(\omega )\) is the time needed for executing one iteration on a single subdomain, \(\mathrm{Time}_P(\omega )\) is the time for doing the same in parallel on P subdomains using P processors, \(\mathrm{Time}_\mathrm{ex}\) is a time needed for a bidirectional transmission of data from one processor to another, and \(l_{\rho ,\mu }\) is the number of data exchanges. From (20) it is clear, that \(\eta _{\rho ,\mu }\) is related to the model stochasticity through \(l_{\rho ,\mu }\), which is for different global operator modes as follows:

$$\begin{aligned} l_{\alpha ,\gamma } =V \cdot n, \quad l_{\alpha ,\delta }=V, \quad l_{\beta ,\gamma } =m\cdot n,\quad l_{\beta ,\delta } =m, \quad l_{\sigma ,\delta }=n. \end{aligned}$$
(21)

From(21) it follows that parallelization efficiency with \(\alpha \)-mode is unacceptably slow, \(\sigma \)-mode is perfect, but often inadequate to the phenomenon under simulation, hence, \(\beta \)-mode is to be studied. Taking into account (18) the number of interproccesor exchanges in \(\beta \)-mode yields:

$$\begin{aligned} l_{\beta ,\gamma } = \lambda _{\beta ,\gamma } \cdot N, \quad _{\beta ,\delta }= \lambda _{\beta ,\delta }\cdot N\cdot n. \end{aligned}$$
(22)

In (20) \(\mathrm{Time}_1(\omega )\) and \(\mathrm{Time}_\mathrm{ex}\) depend on the subdomain size as follows

$$\begin{aligned} \mathrm{Time}_1(\omega )= \tau _\mathrm{op} \cdot H^D, \quad \mathrm{Time}_\mathrm{ex}= t_\mathrm{lat} +V\cdot \tau _\mathrm{ex}, \end{aligned}$$
(23)

where \(\tau _\mathrm{op}\) is a mean time spent to process a cell during an iteration, \(t_\mathrm{lat}\) is the latency time in the data exchange process, and \(\tau _\mathrm{ex}\) is the time of bidirectional transmitting of one byte between two processors.

Taking into account, that for most practical cases \(V \cdot \tau _\mathrm{ex}>> t_\mathrm{lat}\) and substituting (23) and (19) into (20) yields a relation between stochasticity and parallelization efficiency, which is more convenient when expressed through the block radius \(R_{\beta }\).

$$\begin{aligned} \eta _{\beta ,\gamma }= \frac{1}{1+ \lambda _{\beta ,\gamma }\cdot N\cdot V\cdot \nu }= \frac{1}{1+ (2R_{\beta }+1)^D \cdot H^{D-1} }, \end{aligned}$$
(24)

where \(\nu = t_\mathrm{op}/\tau _\mathrm{ex}\), \(C= 2D\cdot R_{\varSigma }\cdot \nu \).

For giving a quantitative insight, let us show the dependencies \(\eta _{\beta ,\gamma }(\lambda _{\beta ,\gamma })\) with \(\nu \approx 1\), with \(N=1000^D\) (3D case) and \(N=1000O^D\) (2D-case) the values being typical for stochastic RD ACA models in Fig. 1.

Analysis of the above relations shows, that parallelization efficiency may be achieved only at the expense of stochasticity. the block size \(m_{\beta }=(2R_{\beta }+1)^D\) being the indicator of the balance between stochasticity and parallelization efficiency.

It is important to notice that only small initial parts of the curves are of interest, corresponding to \(0<R<20\), which comprise \(0<\lambda \beta ,\gamma <10^{-3}\) or \(0<\lambda \beta ,\gamma <5\cdot 10^{-5}\) for 2D-case and 3D-case, respectively. And yet in this interval \(R>3\) does not provide acceptable parallelization efficiency (Fig. 1).

Fig. 1
figure 1

Dependence of parallelization efficiency and stochasticity on the radius of the block for block-synchronous stochastic mode of operation with \(H=1000\), \(\tau _\mathrm{op}=\tau _\mathrm{ex}\): a 2D-case, b 3D-case

3.2 Simulation results

The first mathematical model of a wave front propagation was studied in [20, 21]. In [20] the process had been called “diffusion, combined with increase of substance“. The results turned out to be extremely fruitful not only from the point of view of nonlinear mathematical analysis, but also for modeling a wide class of processes, such as fire propagation, epidemics and weeds spreading, chemical reactions expansion in active media. For contemporary computer simulation the process of front propagation may be considered as a typical component of more complex phenomena [22, 23]. So, its CA model deserves to be investigated in detail.

The propagation front kinetics is regarded as a discrete space, a part of which is occupied by a cloud of particles (Fig. 2). The particles diffuse from the place of high density, where \(v(\mathbf {x})=1\) towards the free space where \(v(\mathbf {x})=0\) , the displacement being accompanied by a reaction of the medium, which is described by a nth order polynomial function of the substance density. In continuous models [20, 21] \(n=2\), the function looks like \(F(v)= c\cdot v\cdot (1- v)\), \( v \in [0,1]\). In a CA \(\aleph =\langle A,X,\varTheta (X)\rangle \) with Boolean \(A= \{0,1\}\), \(u\in \{0.1\},\) the substance density is simulated by a real number \(\langle u(\mathbf {x}) \rangle \), equal to averaged value

$$\begin{aligned} \langle u(\mathbf {x}) \rangle =\frac{ \large \sum \nolimits _{\mathbf {x_j} \in Av(\mathbf {x}))}{u(\mathbf {x})}}{|Av|}, \end{aligned}$$
(25)

where \(Av(\mathbf {x})\) is the averaging area containing a set of cells, whose distance from \(\mathbf {x}\) is less than a certain value.

Fig. 2
figure 2

Evolution of wave propagation process implemented in 64 processors

Fig. 3
figure 3

Dependencies of parallelization efficiency on the block radius for \(\aleph _{\beta ,\gamma }\) with two subomain sizes: \(N=935\times 935\) and \(N=4275\times 4275\)

The global operator \(\varTheta (X)=\Phi _{\beta ,\gamma }(\theta _d(i,j ),\theta _r(i,j))\), where \(\theta _d(i,j)\) is a naive diffusion substitution (6) and \(\theta _r(i,j)\) simulates the reaction.

$$\begin{aligned} \theta _r(i,j): u \mathop {\longrightarrow }\limits ^{p_r}u^\prime , \quad u^\prime = \left\{ \begin{array}{ll} 1, &{} \quad \text{ if } \quad r < c \cdot \langle u \rangle \cdot (1- \langle u \rangle ),\\ 0 &{}\quad \text{ otherwise } , \end{array} \right. \end{aligned}$$
(26)

where r is a random number, \(0 < r \le 1\), c is a constant, further taken as 0.5, probabilities of diffusion and reaction being \(p_d=0.9, p_r= 0.1\)

Simulation of the process was performed for the ACA with stochastic mode of operation on 64 processors. All experiments aimed at obtaining dependencies of computational properties on stochasticity values \(\lambda \), which, for convenience, are given by corresponding values block radius, varying from to \(R_{\beta }=1\) to \(R_{\beta }=10\).

The first simulation experiment was to approve the theoretical formulas (24), in two aspects: the character of the dependence \(\eta (R_{\beta }) \) and the raise of efficiency curve with the increse of the domain size (Fig. 3).

The second experiment aimed to obtain the times \(\mathrm{Time}_1(\omega )\) and \(\mathrm{Time}_{64}(\varOmega )\) for these two cases. In Fig. 4 it is seen, that \(\mathrm{Time}_1(\omega )\) is practically constant, lightly increasing due to the increase of RNG calls.

Fig. 4
figure 4

Dependencies of parallelization efficiency on the block radius for \(\aleph _{\beta ,\gamma }\) with two subdomain sizes: \(N=935\times 935\) and \(N=4275\times 4275\)

4 Conclusion

When simulating a large-scale complex phenomenon thought of in terms of several types of particles interacting with different rates asynchronously in discrete space, the most appropriate model seems to be a stochastic global mode of operation. But the problem of allocating the task on a computer cluster, changes the preference towards the block-synchronous mode, where the choice of the block size m is crucial for achieving a proper balance between of the process stochasticity and parallelization efficiency. The responsibility for a proper choice of m lays on the model designer, who has to assess the role of stochasticity in mathematical representation of the phenomenon and make a decision.

If the computation time is the most important parameter, then m, and consequently the stochasticity, should be taken as small as possible, i.e. \(m_\mathrm{min}=|T^{\prime }_{\varSigma }|\), its impact to the performance being twofold: increasing parallelization efficiency (20) and decreasing stochasticity of the model.